AD-R188  934  SPIRE  BASED  SPEAKER- INDEPENDENT  CONTINUOUS 
RECOGNITION  USING  NIXED.  .  <U>  AIR  FORCE  INST 
HRIGNT-PATTERSON  APB  OH  SCHOOL  OF  ENGI. .  R 
UNCLASSIFIED  DEC  87  AFIT/GE/ENG/87D-14 

SPEECH 

OF  TECH 

0  DAMSON 
F/Q  12/9 

NL 

*1 

■ 

-  ■ 

jH 

;l  1 

!!  1 

“i 

Hi 

■■  'i'  iJ  -Iij" 

MICROCOPY  RESOLUTION  TEST  CHART 

NATIONAL  BUREAU  OF  STANDARDS  196 1  A 


m 


™  FfLE  Zm 


SPIRE  BASED  SPEAKER-INDEPENDENT 
CONTINUOUS  SPEECH  RECOGNITION 
USING  MIXED  FEATURE  SETS 

THESIS 

Robert  G.  Dawson,  Capt,  USAF 
AFIT/GE/ENG/87D-14 


DEPARTMENT  OF  THE  AIR  FORCE 
AIR  UNIVERSITY 


^ELECTE 
FEB  1  01988 


AIR  FORCE  INSTITUTE  OF  TECHNOLOGY 


Wright-Patterson  Air  Force  Base,  Ohio 


88  2  4  059 

{SsfcttYfcra  >•  - - ’ 


SPIRE  BASED  SPEAKER-INDEPENDENT 
CONTINUOUS  SPEECH  RECOGNITION 
USING  MIXED  FEATURE  SETS 

THESIS 

Robert  G.  Dawson,  Capt,  USAF 
AFIT/GE/ENG/87D-14 


AFIT/GE/ENG/87D-1H 


iL*k'iL 


t.|  ui  >.»  i.i  . 


a  >  >  g  >-A-  m  ~ 


SPIRE  BASED  SPEAKER-INDEPENDENT 


CONTINUOUS  SPEECH  RECOGNITION 


USING  MIXED  FEATURE  SETS 


THESIS 


Presented  to  the  Faculty  of  the  School  of  Engineering 
of  the  Air  Force  Institute  of  Technology 
Air  University 

In  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of 
Master  of  Science  in  Electrical  Engineering 


Robert  G.  Dawson,  Capt,  USAF 


December  1987 


Accession  For 

"ntis  GHA&I 
DTIC  TAB 

Unannounced 

Justification. 


By - - 

Distribution/ 


Availability  Codes 
Avail  and/or 
Dist  Special 


Approved  for  public  release:  distribution  unlimited 


K~ 


V\'- 

SVVVvi 

Vi'Ct'pV 


*  *  *  ~  » 1 
»  V  Vi 

v  » 


Acknowledgements 

This  work  is  dedicated  to  those  I  love;  my  parents,  Lt  Col  and  Mrs. 
William  R.  Dawson,  whose  example  made  this  work  possible;  and  my  wife, 
Monica,  whose  love  made  it  worthwhile. 

Also,  special  thanks  to  my  thesis  advisor.  Dr.  Matthew  Kabrisky, 
whose  special  combination  of  inspiration,  knowledge,  and  humor  made  this 


work  actually  fun 


Table  of  Contents 


Page 


Acknowledgements  .  11 

List  of  Figures . vl 

List  of  Tables .  vlil 

Abstract .  lx 

I.  Introduction .  1-1 

Background .  1-1 

Definitions .  1-3 

Templates  and  Features  .  1-3 

Dynamic  Programming  .  1-6 

Connected  Speech  .  1-7 

Speaker  Independence  .  1-7 

Problem .  1-8 

Scope .  1-8 

Approach .  1-9 

Sequence  of  Presentation  .  1-9 

II.  Acoustic  Processing  Environment  .  2-1 

Introduction  .  2-1 

Lisp .  2-1 

SPIRE .  2-2 

Overview .  2-2 

Interfacing  SPIRE  .  2-8 

Hardware . 2-10 

LISP  Machine .  2-10 

Array  Processor .  2-10 

Speech  Digitizer  .  2-11 

Summary  of  Equipment  . .  2-11 

III.  System  Design .  3-1 

Introduction  .  3-1 

Utterance  Processing  .  3-1 

Feature  Extraction  .  3-1 

Wide-Band  and  Narrow-Band  Spectrum  ....  3-1 

LPC  Spectrum .  3-2 

Formants .  3-2 

Frication  Frequency  .  3-2 

Additional  Processing  .  3-5 

Clipping  .....  .  3-5 

Median  Filtering  .  . .  3-5 

Frequency  Compression  .  3-5 

Energy  Normalization  .  3-6 


IV 


V. 


«JWWUUIWIWWWIU»WWB*  W.UlIrtUMMU.iJW  "TWJWrUTTWTTT?  IM'MMIIfUV..  WWJWUPWMW 


Ready-Utterance  .  3-6 

Ready-Template  .  3-6 

Dynamic  Programming  Algorithm  .  3-6 

Introduction  .  3-6 

Distance  Arrays  .  3-7 

One-Stage  Algorithm  for 

Connected  Word  Recognition  .  3-9 

Time  Distortion  Penalties  .  3-10 

Summary  of  Steps .  3-11 

Storage . . .  3-14 


Results  and  Discussion . 4-1 

Introduction  .  4-1 

Distance  Array  Contour  .  4-1 

Wide-Band  Spectrum  .  4-1 

Narrow-Band  Spectrum  .  4-2 

LPC  Coefficients .  4-2 

LPC  Spectrum  . .  4-2 

Formants  . .  4-2 

Frication  Frequency  .  4-3 

Zero  Crossing  Rate .  4-3 

Recognition  of  Connect  Speech  .  ......  4-3 

Speaker-Dependent  Results  .  4-3 

Wide-Band  Spectrum  .  4-11 

Narrow-Band  Spectrum  .  4-11 

Formants .  4-11 

LPC  Spectrum .  4-11 

LPC  Spectrum,  Formants,  Frication  Frequency  4-11 

Speaker-Independent  Results  .  4-13 

Single-Speaker  Template  Sets  .  4-17 

Multi-Speaker  Template  Sets  .  4-17 

Overall  Results  Using  LPC  Spectrum  .  4-18 

Overall  Results  Using  LPC  Spectrum, 

Formants,  and  Frication  Frequency  .  4-19 


Conclusions  and  Recommendations 


5-1 


Introduction . 5-1 

Conclusions .  5-1 

Recommendations  .  5-1 

Environmental  Stress  .  5-1 

Tailored  Template  Sets  .  5-1 

Additional  Features  .  5-2 

Syntactic  Rules . 5-2 

Dedicated  Hardware  .  5-2 

Summary .  5-2 


-V'  ■ 
\v:ai 


>i 


*  «  ** 
vS 


* 

M  *  O* 


'm 


4 


y 


V  V 
.V 


'■'2 


iv 


2.1  Original  Waveform . 2-4 

2.2  Overlaid  Displays  .  2-5 

2.3  Synchronized  Displays  .  2-6 

2.4  Standard  SPIRE  Displays  .  2-7 

2.5  SPIRE  Interface  Functions  ....  .  2-9 

2.6  SPIRE  Result  Arrays  . . 2-10 

3.1  Spectral  Slices .  3-3 

3.2  Formants  Frication  Frequency  ...  .  3-4 

3.3  Hypothetical  Distance  Array  .  3-8 

3.4  Hypothetical  Accumulated  Distance  Array . 3-8 

3.5  Time  Distortion  Penalties . 3-11 

3.6  Hypothetical  Distance  Array  for  Continuous  Speech  .  3-12 

3.7  Transition  Rules  .  3-13 

3.8  Backpointers  From  Three  Preceding  Grid  Points  .  3-15 

3.9  The  Backtracking  Proceedure  .  3-15 

3.10  Schematic  Diagram  .  3-16 

*1.1  Distance  Array  Contour,  Wide-Band  Spectrum .  4—11 

4.2  Distance  Array  Contour,  Narrow-Band  Spectrum  .  4-5 

4.3  Distance  Array  Contour,  LPC  Coefficients  .  4-6 

4.4  Distance  Array  Contour,  LPC  Spectrum  .  4-7 

4.5  Distance  Array  Contour,  Formants  .  4-8 

4.6  Distance  Array  Contour,  Frication  Frequency  .  4-9 

4.7  Distance  Array  Contour,  Zero  Crossing  Rate  .  4-10 


List  of  Tables 


4.1  Speaker  Dependent  Feature  Results  . . 

4.2  Single  Template  Speaker  Independent  Results  . 

4.3  Multi-Template  Speaker  Independent  Results  . 

4.4  Overall  Results,  LPC  Spectrum  . 

4.5  Overall  Results,  LPC  Spectrum,  Formants,  Frication  Frequency 


AFIT/GE/ENG/87D-14 


Abstract 


_ •  A  system  was  developed  to  investigate  continuous  speech 

recognition.  The  system  incorporates  multiple  features  and  dynamic 
programming  to  recognize  continuous  inputs  of  the  spoken  digits  (zero 
through  nine).  The  fundame  .tal  design  concept  extends  from  previous 
successful  recognition  research  efforts  involving  both  isolated  and 
continuous  speech  using  multiple  feature  sets,  multiple  template  sets, 
and  dynamic  programming.  Among  the  features  used  in  the  investigation 
are  wide  band  spectrogram,  narrow  band  spectrogram,  linear  predictive 
coding  (LPC )  coefficients,  LPC  spectrum,  frication  frequency,  and 

called  SPIRE  provided 
raw  features  J)  The 
series  LISP  machine. 


formant  tracks.  An  advanced  speech  research  tool 
the  computational  functions  needed  to  extract  the 
system  is  implemented  in  LISP  on  a  Symbolics  3600 


i 


ix 


SPIRE  BASED  SPEAKER-INDEPENDENT 
CONTINUOUS  SPEECH  RECOGNITION 
USING  MIXED  FEATURE  SETS 


I.  Introduction 


The  ability  to  communicate  using  language  is  considered  one  of  the 
hallmarks  of  the  human  race.  As  machines  attempt  to  do  more  and  more  of 
what  humans  do,  it  becomes  necessary  for  them  to  also  have  the  ability 
to  use  language.  And  before  this  can  be  acheived,  machines  must  be  able 
to  accurately  identify  spoken  words  without  the  aid  of  syntax  or 
semantics.  Even  simple  word  recognition  by  computer  would  have  many 
potential  applications,  from  voice  controlled  television  sets  to  voice 
activated  displays  in  the  cockpit  of  an  F — 1 6  fighter  aircraft. 
Ultimately,  speech  recognition  is  3een  as  essential  to  the  total 
automation  of  the  human-machine  interface,  including  automated 
dictation,  language  translation,  and  artificial  intelligence  devices. 
Although  word  recognition  has  improved  steadily  over  the  past  decade, 
general  speech  recognition  devices  still  do  not  exist.  Therefore,  much 
research  will  be  necessary  before  such  general  speech  recognition 


devices  can  be  perfected. 


Background 

Speech  recognition  is  a  relatively  new  field  of  study  made  possible 
only  by  recent  advances  in  computer  and  digital  signal  processing 
technology.  Today,  a  variety  of  speech  recognition  systems  are 
commercially  available  ranging  from  expensive  stand  alone  units  to 
plug-in  boards  for  personal  computers.  However,  many  practical  problems 
still  exist.  There  are  many  problems  yet  to  be  solved  if  speech 
recognizing  devices  are  to  find  their  way  into  common  use.  Most  current 
research  is  focused  upon  finding  better  ways  to  represent  speech,  and 
once  represented,  better  ways  of  handling  the  variations  of  speech 
patterns  typical  of  a  diverse  population.  Only  if  these  problems  areas 
are  solved  will  the  more  complex  problems  of  connected  speech  be  solved. 

Although  speech  recognizers  have  improved  steadily  over  the  past 
few  years,  they  are  still  hampered  by  certain  serious  performance 
problems.  The  most  intractable  of  these  is  inaccuracy.  There  are  many 
systems  available  claiming  as  high  as  95?  accuracy,  meaning  the  system 
can  correctly  identify  spoken  words  95?  of  the  time.  However,  in  real 
environments  these  claims  are  more  hopeful  than  true.  (1^:200) 

Current  voice  recognizers  suffer  from  many  operational  deficiencies 
resulting  from  inaccuracy.  Almost  all  systems  available  today  require 
extensive  user  training.  The  user  must  train  the  system  to  recognize 
his  voice  and,  therefore,  the  system  is  speaker-dependent.  There 
always  seems  to  be  a  certain  percentage  of  the  population  for  whom  the 
system  performs  very  poorly.  Some  machines  work  better  with  male  voices 
than  with  female  voices  or  vice  versa.  Current  speech  recognizers 


perform  poorly  in  the  presence  of  background  noise  such  as  office  or 


factory  noise.  Further,  most  current  speech  recognizers  can  only 
recognize  purposely  isolated  words  rather  than  more  natural  continuous 
speech.  Finally,  current  speech  recognizers  lack  large  enough 
vocabularies  to  be  useful  for  many  applications.  (1*4:200) 

Specific  problem  areas  in  speech  recognition  include  selection  of 
good  feature  sets  and  template  sets,  the  problem  of  connected  speech, 
and  the  problem  of  speaker-independent  speech  recognition.  These  terms 
will  be  discussed  fully  in  the  following  paragraphs. 

Definitions 

Templates  and  Features.  A  template  is  a  set  of  data  that  represents 
each  word  of  the  given  vocabulary.  The  fundamental  task  of  speech 
recognition  is  matching  the  spoken  word,  called  an  utterance,  to  a  set 
of  stored  templates  and  deciding  which  template  the  utterance 
represents.  There  is  typically  a  template  for  each  word  in  the 
vocabulary.  Templates  are  created  through  a  process  known  as  "training" 
in  which  the  user  repeats  the  vocabulary  into  the  recognizer.  Features 
are  then  extracted  and  stored  as  templates  C 1 2 : 489  ) .  Herein  lies  the 
basic  challenge  of  speaker-independent  speech  recognition,  that  is 
"creating  a  set  of  templates  that  can  be  used  reliably  with  many 
different  speakers"  (6). 

A  template  set  that  can  be  used  reliably  with  many  different 
speakers  must  contain  those  features  that  best  represent  each  word  of 
the  vocabulary.  Exactly  what  features  carry  the  essential  information 
that  distinguishes  a  word  from  the  rest  is  not  completely  clear. 
Brusueles  (2)  investigated  this  problem  by  extracting  55  different 


1-3 


features,  grouped  in  six  general  categories,  for  each  word  in  a  13  word 
vocabulary.  The  six  general  categories  were: 

(1)  Wide-band  Spectrogram.  Graphic  depiction  of  frequency 

content . 

(2)  Zero  Crossing  Rate.  A  count  of  the  number  of  times  the 
waveform  passes  through  a  region  centered  around  zero. 

(3)  LPC  gain  and 

(4)  LPC  coefficients.  Coefficients  and  gain  terms  used  in  a 
speech  coding  technique  called  Linear  Predictive  Coding  (LPC)  in  which 
the  speech  production  process  is  modeled  rather  than  the  waveform 
itself.  This  is  done  by  an  adaptive  filtering  process  in  which  the 
filter  coefficients  are  calculated  so  as  to  simulate  the  vocal  tract 
which  itself  is  a  filter.  These  coefficients  are  commonly  referred  to 
as  linear  predictive  coding  (LPC)  coefficients.  These  LPC  coefficients 
can  be  used  to  reproduce  a  copy  of  the  original  speech  waveform.  Doing 
so  allows  voice  to  be  digitally  transmitted  at  a  very  low  bit  rate  and 
thus  a  very  small  bandwidth  or  storage  capacity.  Because  of  this  data 
reduction  property,  LPC  coefficients  are  often  used  for  speech 
recognition  (4:26). 

(5)  Formants .  Resonant  frequencies  of  the  vocal  tract. 

(6)  Time .  Time  over  which  utterance  is  spoken. 

While  certain  of  these  55  different  features  proved  useful  during 
word-template  matching,  others  were  less  so.  Further,  their  usefulness 
changed  with  different  speakers  and/or  words.  Brusueles  suggests  that 
it  may  be  possible  to  use  some  sort  of  statistical  weighting  of  features 
to  improve  the  word-to-template  matching  algorithms  (2:68-69). 


"Vector  quantization"  is  a  another  coding  technique  used  in  speech 
recognition.  In  this  method,  the  different  sounds  that  are  produced  by 
the  vocal  tract  are  represented  by  individual  numbered  codes.  Then 
words  are  represented  by  vectors  made  up  of  these  codes.  For  example, 
the  word  "Bill"  could  be  represented  by  the  vector  <4,  32,  32,  32,  32, 
20,  20>,  where  iJ,  32,  and  20  represent  the  sounds  "b",  "i",  and  "1" 
respectively.  In  their  work  Burton,  Shore,  and  Buck  applied  "vector 
quantization"  and  achieved  a  recognition  accuracy  of  98?  for  speaker- 
independent  recognition  of  isolated  digits,  "zero"  through  "nine" 
(3:837).  Vector  quantization  is  a  relatively  new  speech  encoding 
technique . 

Speech  can  be  also  modeled  as  a  Markov  chain  in  which  the  current 
signal  state  is  somewhat  dependent  on  the  previous  signal  state.  For 
example,  vowel  sounds  are  more  likely  to  follow  consonant  sounds. 

Signal  modeling  based  on  "hidden  Markov  models"  (HMM)  may  be  viewed  as 
"a  technique  that  extends  conventional  stationary  spectral  analysis 
principles  to  the  analysis  of  time  varying  signals."  Juang  and  Rabiner 
investigated  two  types  of  Markov  models.  One  was  based  on  finite 
mixtures  of  Gaussian  autoregressive  densities  (GAM),  and  the  other  was 
based  on  nearest-neighbor  partitioned  finite  mixtures  of  Gaussian 
autoregressive  densities  (PGAM).  Juang  and  Rabiner  determined  that  GAM 
and  PGAM  models  have  applicability  to  speaker  independent  digit 
recognizers  ( 5 : 140-4 ,1412). 

A  fundamental  goal  in  speaker-independent  speech  recognition  is 
that  of  creating  a  set  of  reference  data  or  templates  that  can  be  used 
reliably  with  many  different  speakers.  One  way  that  has  been  used  is 
that  of  template  averaging.  This  method  is  accomplished  by  recording 


as  many  as  10  tokens  (examples)  of  each  word  in  the  particular 
vocabulary  and  averaging  them  into  one  template  set  (1» : 32 ) .  Brusueles 
found  that  a  template  made  by  averaging  two  males  and  one  female 
performed  better  than  a  template  made  by  averaging  three  males  in  a  test 
population  of  seven  males  and  three  females  suggesting  that  a  wide 
variation  of  reference  patterns  may  be  more  effective  than  a  narrow 
variation  (2:28). 

Multiple  template  sets  have  also  been  used  to  support  speaker- 

independent  speech  recognition  (14:29).  In  this  case  the  tokens  are 

stored  individually  and  compared  individually  as  though  they  were 

separate  words  in  the  vocabulary.  One  technique  involved  "clustering" 

100  repetitions  of  each  word  in  a  39  word  vocabulary.  The  vocabulary 

consisted  of  the  26  letters  of  the  alphabet,  the  10  digits,  and  three 

command  words  (STOP,  ERROR,  and  REPEAT).  Average  recognition  accuracies 

of  close  to  97  percent  were  obtained  on  38  of  the  40  talkers  (13:583). 

This  method  has  the  advantage  of  being  able  to  represent  a  wider  range 

of  population,  however  an  obvious  disadvantage  is  the  increased 

computation  required  to  perform  necessarily  more  comparisons  than  for  a 

vocabulary  represented  by  single  templates.  (10:263). 

Dynamic  Programming.  Regardless  of  the  method  chosen  to  accomplish 

word-to-template  matching,  it  is  usually  necessary  to  establish  optimum 

time  alignment  between  the  input  and  reference  speech  data. 

Originally,  advanced  research  in  speech  recognition  employed 
relatively  simple  techniques  to  partition  a  speech  signal  into 
separate  units,  then  very  complex  methods  to  classify  the 
segments  and  recover  from  segmentation  errors.  It  was  soon 
realized  that  signals  could  not  be  reliably  segmented  without 
prior  knowledge  of  the  acoustic  sound  class.  In  the  early 
1970's  a  technique  called  "dynamic  programming"  was 


introduced.  Dynamic  programming  improves  the  segmentation 
process  by  hypothesizing  acoustic  events  and  testing  each 
hypothesis  at  an  acoustic  level  (4:26). 

A  one-stage  dynamic  programming  algorithm  for  connected  speech  has  been 

proposed  by  Ney  (10).  This  algorithm  was  actually  proposed  as  far  back 

as  1971  by  T.  K.  Vintsyuk.  Ney  states  that. 

An  advantage  is  that  the  three  operations  of  word  boundary 
detection,  nonlinear  time  alignment,  and  recognition  are 
performed  simultaneously:  thus,  recognition  errors  due  to 
errors  in  word  boundary  detection  or  to  time  alignment  errors 
are  not  possible. 

Dynamic  programming  is  considered  one  possible  springboard  for  future 
advances  in  speech  recognition  (4:26). 

Connected  Speech.  Connected  speech,  as  opposed  to  isolated  speech, 
presents  additional  unsolved  problems.  Although  some  speech  recognizers 
can,  to  a  limited  extent,  recognize  words  without  pauses  between  the 
words,  they  are  less  accurate  and  more  expensive.  Connected  speech  is 
difficult  for  a  number  of  reasons.  One  problem  is  that  of  detecting 
word  boundaries.  Although  some  techniques  don't  require  word  boundary 
detection,  these  techniques  pay  a  penalty  in  terms  of  much  more  intense 
data  comparison  requirements.  The  real  difficulty  of  connected  speech 
recognition  stems  from  the  fact  that  acoustic  variation  of  words  spoken 
in  connected  speech  is  much  greater  than  when  the  words  are  spoken  in 
isolation.  This  is  due  to  the  "coarticulation"  of  neighboring  sounds. 
The  position  of  the  tongue,  jaw,  and  lips  in  one  speech  sound  are 
affected  by  their  previous  and  future  positions.  Further,  the  time 
variations  of  words  is  more  severe  for  continuous  speech  than  for 
isolated  speech  (4:27). 


Speaker-Independence .  It  is  obvious  that  humans  can  recognize  the 
speech  of  a  variety  of  speakers  without  the  need  of  any  training 
process.  Somehow,  the  brain  is  able  to  extract  the  key  features  of 
speech,  determining  what  is  being  said  even  though  different  speakers 
may  say  the  same  thing  differently.  Current  word  recognition  systems 
simply  lack  this  "robustness"  that  is  necessary  for  most  applications. 
One  approach  to  improving  "robustness"  and  thus  accuracy  lies  in 
developing  systems  that  are  speaker-independent  (6). 

Problem 

The  primary  purpose  of  this  thesis  was  to  develop  a  system  for 
connected  speech  recognition  and  examine  the  usefulness  of  using 
multiple  templates,  multiple  features,  and  dynamic  programming.  The 
system  has  been  implemented  on  a  Symbolics  3600  Series  Lisp  Machine 
using  an  advanced  speech  analysis  tool  called  SPIRE  (Speech  Processing 
Interactive  Research  Environment). 

Scope 

The  recognition  system  developed  is  based  on  recognition  of 
continuously  spoken  digits,  zero  through  nine.  The  small  vocabulary  was 
necessary  due  to  limited  time  and  disk  space,  however,  recognition  of 
the  digits  provides  a  sufficient  challenge  for  the  purpose  of  this 
research.  This  research  has  investigated  what  features  are  best  suited 
for  speech  recognition  and  how  best  combine  them.  The  dynamic 
programming  algorithm  used  is  identical  to  the  one-stage  dynamic 
programming  algorithm  proposed  by  Ney  (10).  Other  programming  techniques 


have  not  been  directly  addressed.  For  the  most  part,  SPIRE  is  used  as 


a  library  of  functions  called  by  the  main  LISP  program,  although  SPIRE 
can  be  used  in  an  interactive  mode  as  well. 


Approach 

The  approach  is  outlined  as  follows.  First,  individual  feature 
performance  will  be  observed  by  plotting  "distance  array  contours". 

Next,  a  continuous  speech  dynamic  time  warping  algorithm  will  be  used  to 
further  study  features.  Finally,  speaker-independent  continuous  speech 
will  be  studied. 

Sequence  of  Presentation 

Chapter  two  presents  the  acoustic  processing  environment.  In 
particular,  the  chapter  introduces  SPIRE,  a  powerful  speech  analysis 
research  tool,  as  well  as  the  Symbolics  3600  Series  Lisp  Machine. 

Chapter  three  defines  the  system  design  including  the  basic 
algorithms  developed.  A  basic  explanation  of  the  dynamic  time  warping 
algorythm  is  included  here. 

Chapter  four  presents  the  results.  First,  individual  feature 
performance  is  investigated  to  see  which  feature  sets  are  best  suited  to 
speech  recognition  and  optimal  ways  of  combining  multiple  feature  sets 
to  increase  performance.  The  best  of  these  are  then  tested  for 
speaker-independent  performance.  Last,  multiple  template  sets  are  used 
in  an  effort  to  improve  speaker-independent  performance. 

Chapter  six  provides  conclusions  and  recommendations,  and 
appendices  present  additional  results,  program  description,  and 
listings. 


1-9 


.V.V^ViV.V.V-V 


II .  Acoustic  Processing  Environment 


Introduction 

The  purpose  of  this  chapter  is  to  introduce  the  software  and 
hardware  components  used  to  develop  the  recognition  system.  The  chapter 
is  divided  into  three  sections.  The  first  section  describes  the  Lisp 
programming  language.  The  next  section  describes  SPIRE,  an  advanced 
speech  analysis  tool  that  provides  many  of  the  computational  functions 
used  for  feature  extraction  and  general  speech  processing.  The  last 
section  describes  the  hardware  configuration  that  is  used  as  well  as 
other  optional  hardware. 

Lisp 

Lisp  is  a  high  level  programming  that  takes  its  name  from  List 
Programming.  Lisp,  one  of  the  oldest  active  programming  languages,  is 
widely  used  in  the  field  of  artificial  intelligence  (11).  Lisp  is  an 
extremely  powerful  language  for  handling  large  amounts  of  data  common  in 
artificial  intelligence  applications.  In  fact  special  purpose  computers 
called  Lisp  Machines  are  designed  at  the  circuit  level  especially  for 
running  Lisp.  Together,  these  provide  a  powerful  computing  environment 
with  a  large  virtual  address  space  that  make  it  "particularly  attractive 
for  speech  and  signal  processing  applications"  (2:5). 

There  are  many  dialects  of  Lisp,  however,  one  dialect  called  COMMON 
Lisp  seems  to  be  emerging  a3  a  standard.  Most  Lisp  machines  now  in 
production  have  Common  Lisp  a3  a  standard  feature.  Older  machines  may 
use  different  dialects.  The  Symbolics  3600  Series  Lisp  Machine  used  for 


this  work  uses  a  dialect  called  Z°ta  Lisp 


SPIRE 


Overview  (2:6-12).  SPIRE  stands  for  Speech  and  Phonetics 
Interactive  Research  Environment.  [SPIRE  is  available  by  license 
through  the  MIT  Patent  Office.]  It  is  a  software  program  that  allows 
the  user  to  interactively  examine  and  process  speech  and  other  audio 
signals.  The  following  paragraphs  provide  an  overview  of  SPIRE's  design 
philosophy,  graphical  capabilities,  implementation  considerations,  and 
documentation . 

SPIRE  was  designed  to  be  easy  enough  to  use  for  tha  novice, 
yet  powerful  enough  for  even  the  most  advanced  users.  In  the 
interactive  mode,  SPIRE  takes  full  advantage  of  the  Lisp  Machine's  built 
in  graphical  interface  for  quick  and  easy  research.  For  the  more 
advanced  user,  SPIRE  allows  relatively  painless  customization  and 
modification.  The  interactive  mode  is  very  useful  for  learning  about 
the  various  attributes  of  speech.  The  next  paragraph  describes  some  of 
SPIRE's  more  common  capabilities.  For  more  detailed  information 
concerning  the  use  of  SPIRE,  the  reader  is  referred  to  various  SPIRE 
documentation  (7),  (9),  (15),  and  (16). 

SPIRE  takes  full  advantage  of  the  graphical  capabilities  of 
the  Symbolics  Lisp  Machine,  providing  bit-mapped  display  which  is  either 
1280  pixels  wide  by  760  pixels  high  or  1216  pixels  wide  by  773  pixels 
high,  depending  on  the  model.  The  following  figures  illustrate  a  small 
sample  of  these  capabilities  for  the  utterance,  "This  is  the  CBS  Evening 
News ." 


2-2 


.  -■V 


.  V—  V-  f  L 


Figure  1.1  Figure  1.1  shows  two  repetitions  of  the 
orthographic  transcription  and  original  waveform  of  the  utterance.  Note 
that  the  scale  of  the  two  displays  are  different  to  allow  closer 
examination  of  waveform  details. 

Figure  1.2  The  second  figure  shows  four  displays  of  the  same 
utterance;  Orthographic  Transcription,  Wide-Band  Spectrogram,  Formants, 
and  Original  Waveform.  Two  of  the  displays  are  overlaid — the  Wide-Band 
Spectrogram  and  the  Formants.  Such  overlays  can  make  it  easier  to  track 
similarities  among  various  representations  of  the  data. 

Figure  1.3  The  third  figure  illustrates  another  important 
feature  of  SPIRE:  its  ability  to  synchronize  displays.  For  example,  in 
the  top  display,  there  is  a  "cursor"  located  at  1.8251  seconds  of  the 
Original  Waveform.  The  curser  is  automatical  place  at  the  same  point  in 
the  next  display,  the  Narrow-Band  Spectrogrm.  The  next  display  shows 
the  Narrow-Band  Spectral  Slice  at  that  cursor  position. 

Figure  1.H  The  fourth  figure  identifies  typical  display  types 
available  from  SPIRE. 


0.0000  NEWSCAST 

0.5115 


S  Evening 

News 

Orthographic  Transcription 

2.6000 

0.0000 


0.2615 


NEWSCAST  Original  Waveform  2.6000 

0.5115 


la  The 


NEWSCAST  Orthographic  Transcription  0.7615 


0.5115 


NEWSCAST 


Orthographic  Transcription 


Figure  1.2  Overlaid  Displays 


Energy,  Total 
Energy,  0  to  5000  Hz 
Energy,  120  to  440  Hz 
Energy,  3400  to  5000  Hz 
Energy,  640  to  2800  Hz 
Formants,  All  Four 
Formant,  First 
Formant,  Second 
Formant,  Third 
Formant,  Fourth 
Frication  Frequency 
LPC  Center  of  Gravity 
LPC  Gain  Term 

LPC  Predictor  Coefficients 
LPC  Spectrum  Slice 
Narrow-Band  Spectrogram 
Narrow-Band  Spectral  Slice 
Narrow-Band  Spectrum  Slice 
Original  Waveform 
Orthographic  Transcription 
Phonetic  Transcription 
Pitch  Frequency 
Waveform  Envelope 
Wide-Band  Spectrogram 
Wide-Band  Spectral  Slice 
Wide-Band  Spectrum  Slice 
Zero  Crossing  Fate 


Figure  1.4  Standard  SPIRE  Displays 


Interfacing  SPIRE  from  Lisp.  Behind  each  SPIRE  display  are  the 
underlying  computations  required  for  computing  that  display,  for  example 
a  Fourier  transform.  These  underlying  processes  are  available  through 
Lisp  as  simple  function  calls.  Figure  1.5  describes  the  primary 
functions  used  to  make  SPIRE  perform  computations  on  an  utterance. 

The  three  functions  of  figure  1.5  can  be  combined  into  a  single 
Lisp  expression.  For  example, 

CSETQ 

RESULT-ARRAY 

( SPIRE: ATT-? AL  (SEND  (SPIRE: UTTERANCE  PATHNAME) 

:FIND-ATT  ATT-NAME))) 

where  RESULT-ARRAY  is  the  variable  containing  the  result  of  the 
computation  defined  by  the  variable  ATT-NAME  performed  on  the 
utterance  defined  by  the  variable  PATHNAME. 

When  no  more  computations  are  necessary  on  a  particular  utterance, 
the  utterance  may  be  "killed"  or  "unloaded"  as  follows: 

(SEND  (SPIRE: UTTERANCE  PATHNAME)  :KILL) 

where  the  variable  PATHNAME  describes  the  utterance  to  be  killed. 

Note  that  the  method  used  here  does  not  alter  any  of  SPIRE's 
default  attributes.  Appendix  A  shows  a  list  of  SPIRE’s  attribute 
defaults. 

When  SPIRE  is  called  to  perform  a  computation  on  an  utterance,  the 
result  is  returned  in  the  form  of  an  array,  the  dimensions  of  which 
depend  on  the  type  of  computation.  Figure  1.6  lists  the  array  types 
returned  for  various  SPIRE  function  calls. 


SPIRE: UTTERANCE 

Parameters:  pathname  (required) 

Type:  function 
Returns:  utterance-flavor 

Description:  The  utterance  in  the  file  "pathname"  becomes  the  current 
utterance  in  SPIRE.  If  needed  the  utterance  is  loaded 
into  memory  from  disk.  This  function  must  be  called 
before  any  computation  can  take  place. 


:PIND-ATT 

Parameters:  att-name  (required) 

Type:  message  to  utterance  flavor 
Returns:  att 

Description:  att-name  is  a  string  that  identifies  what  attribute  the 
user  desires  SPIRE  to  compute.  For  example,  assume  we 
are  to  compute  the  Wide-Band  Spectrum  of  an  utterance 
stored  in  the  file  ">DAWSON>OTTS>ZERO.UTT" .  First, 
select  the  utterance: 

(SETQ  TEMPI 

(SPIRE :UTTERANCE  ">DAWSON>UTTS>ZERO.UTT") ) 

TEMPI  stores  the  utterance  flavor  for  the  next  step: 

(SETQ  TEMP2 

(SEND  TEMPI  :  FIND-ATT  "WIDE-BAND  SPECTRUM")) 

TEMP2  now  holds  the  att  from  which  the  actual  values 
may  be  extracted  (see  next  function). 


SPIRE: ATT-FAL 

Parameters:  att  (required) 

Type:  function 

Returns:  array  (results  of  computation) 

Description:  This  function  returns  the  computed  value  of  the  att  we 
are  interested  in.  For  example,  if  TEMP2  holds  the 
att  (as  discussed  above),  extract  the  values: 

(SETQ  TEMP3  ( SPIRE :ATT-VAL  TEMP2) ) 

TEMP3  now  holds  the  "Wide-Band  Spectrum"  values. 

Similar  procedures  are  followed  for  obtaining  the  values 
of  any  of  the  standard  SPIRE  computations. 


Figure  1.5  SPIRE  Interface  Functions 


■  (.<  avvJ  i.^ 


•  |  |  ft4a  ftt.  i 


Attribute  Naae 

Result  Array 

Wide-Band  Spectrum 

2-D, 

256 

X 

N 

Narrow-Band  Spectrum 

2-D, 

256 

X 

N 

LPC  Spectrum 

2-D, 

256 

X 

N 

LPC  Coefficients 

2-D, 

19 

X 

N 

Formants  (four) 

2-D, 

5 

X 

N 

LPC  Gain  Term 

1-D, 

N 

LPC  Center  of  Gravity 

1-D, 

N 

Zero  Crossing  Rate 

1-D, 

N 

Frication  Frequency 

1-D, 

N 

Total  Energy 

1-D, 

N 

N  =  time  *  analysis 

rate 

Figure  1,6  SPIRE  Result  Arrays 


Hardware  (13,15:16), 

SPIRE  is  a  software  package  that  requires  specific  hardware  to 
run.  A  brief  description  of  hardware  options  are  discussed  below. 

Lisp  Machine.  (Required)  SPIRE  is  designed  to  run  on  a  Symbolics 
3600  Series  Lisp  Machine.  The  Symbolics  Lisp  Machine  is  a  powerful 
computer  specifically  designed  to  efficiently  run  Lisp  code.  It 
provides  an  extremely  efficient  user  interface  with  extensive  graphics 
capabilities.  Also  available  from  Symbolics  is  a  Floating  Point 
Accelerator  (FPA)  card  designed  to  speed  up  floating  point  operations  by 
about  a  factor  of  three.  The  FPA  is  an  add-on  card  that  is  generally 
invisible  to  the  application  software  such  as  SPIRE. 

Array  Processor.  Certain  versions  of  SPIRE  are  designed  to  support 
a  Floating  Point  Systems  FPS  100  (or  FPS  200)  array  processor.  An  array 
processor  is  a  special  purpose  device  designed  to  quickly  handle 
computations  on  large  arrays  of  data.  The  FPS  100  is  connected  to  the 


Symbolics  Lisp  machine  through  a  UNIBUS  interface.  The  array  processor 
can  drastically  reduce  the  computation  time  required  for  certain  SPIRE 
functions.  An  approximate  comparison  between  a  "bare"  Symbolics  Lisp 
Machine,  one  with  an  FPA,  and  one  with  an  FPS  100. 


Configuration 

Ratio 

Example 

FPS  Array  Processor 

10 

1.0  minutes 

Floating  Point  Accelerator 

3 

3.0  minutes 

Bare  Lisp  Machine 

1 

10.0  minutes 

Speech  digitizer.  SPIRE  is 

designed  to  operate 

with  a  Digital 

Sound  Corporation  (DSC)  analog-to-digital  converter.  The  DSC  is 
connected  to  the  Symbolics  Lisp  machine  via  the  UNIBUS  interface.  The 
DSC  is  used  primarily  to  digitize  speech  and  other  audio  signals.  The 
audio  input  can  be  direct  or  prerecorded  and  fed  through  line-in  jacks. 
The  DSC  can  also  be  used  for  high  quality  playback  of  the  digitized 
signals. 

Summary  of  Equipment 

The  Symbolics  Lisp  Machine  actually  used  for  this  speech 
recognition  research  was  an  older  model  Symbolics  3600  with  one 
mega-word  of  RAM  operating  under  version  6.0  of  the  operating  system. 

The  Lisp  Machine  was  equipped  with  a  Floating  Point  Accelerator  to 
reduce  computation  time.  An  FPS  100  array  processor  was  not  connected. 
Speech  samples  were  digitized  on  a  using  a  noise  reducing  microphone  fed 


directly  into  a  DSC  A/D  converter.  Version  17.2  of  SPIRE  was  used 


III.  System  Design 


Introduction 

The  purpose  of  this  chapter  is  to  describe  the  system  design.  This 
chapter  will  provide  details  about  the  major  processing  functions  and 
how  they  are  used.  It  will  also  describe  generally  how  major  groups  of 
data  are  handled.  Finally,  an  description  of  the  dynamic  programming 
algorithm  used  is  given. 

Utterance  Processing 

As  mentioned  earlier,  the  continuous  speech  recognition  system  is 
designed  around  ZetaLisp  and  SPIRE.  SPIRE  is  used  as  a  function  library 
that  is  called  by  the  main  Lisp  routines.  A  discussion  of  how  this  is 
done  is  given  in  chapter  two.  Processing  of  an  utterance  consists  of 
specific  computations  done  by  SPIRE  on  the  original  digitized  waveform, 
plus  any  additional  processing  done  by  the  main  Lisp  routines.  Several 
Lisp  functions  are  defined  for  this  purpose  depending  on  the  desired 
features.  (See  Appendix  B). 

Feature  Extraction.  Feature  extraction  consists  of  calling  SPIRE, 
with  the  filename  of  the  utterance  and  the  name  of  the  feature,  to 
perform  the  necessary  computations  and  thus  return  the  desired  feature. 
This  is  done  by  a  function  called  COMPUTE-ATT.  (See  Appendix  B).  A 
discussion  of  methods  used  by  SPIRE  to  compute  the  desired  features 
follows. 

Wide-Band  and  Narrow-Band  Spectrum.  Spectrum  calculations  are 
returned  by  SPIRE  as  two  dimensional  arrays,  256  X  N,  where  N  is 
proportional  to  the  length  of  the  utterance.  In  both  cases,  the 

3-1 


C 


wide-band  spectrum  and  the  narrow-band  spectrum,  the  original  waveform 
is  pre-emphasized  and  then  run  through  a  256  point  Fast  Fourier 
Transform  (FFT)  routine  incorporating  a  Hamming  window.  The  wide-band 
spectrum  is  calculated  using  a  filter  bandwidth  of  300.0  Hz,  while  the 
narrow-band  spectrum  uses  a  filter  bandwidth  of  78.0  Hz.  Accordingly, 
the  narrow-band  spectrum  provides  more  frequency  resolution  than  does 
the  wide-band  spectrum.  The  results  are  returned  in  256  discrete 
frequency  components  representing  0  to  8000  Hz  in  log-magnitude  form. 
Figure  3.1  shows  an  example  of  wide-band  and  narrow-band  spectral 
slices. 

LPC  Spectrum.  The  LPC  spectrum  result  is  similar  to  wide-band 
spectrum  above,  except  the  LPC  coefficients  are  used  to  calculated  the 
spectrum.  The  LPC  spectrum  generally  resembles  a  smoothed  version  of 
the  wide-band  spectrum.  Figure  3.1  shows  an  example  of  LPC  spectrum 
slice  as  well  as  wide-band  and  narrow-band  spectral  slices. 

Formants.  Formants  are  returned  by  SPIRE  as  a  two  dimensional 
array,  5  X  N,  where  N  is  proportional  to  the  length  of  the  utterance. 
Rows  one  through  four  of  this  array  represent  the  first  four  formant 
frequencies,  respectively.  Row  zero  is  not  used.  Formant  values  are 
computed  from  the  LPC  Spectrum.  The  formant  peaks  are  found  by  fitting 
a  polynomial  to  each  LPC  spectral  slice.  The  polynomial  is  then 
differentiated  and  solved  for  zeros.  Formant  tracts  are  somewhat 
erratic.  The  formant  tracking  algorithm  usually  loses  track  during 
fricative  sounds.  Figure  3*2  shows  an  example  of  formants  along  with 
original  waveform  and  frication  frequency. 

Frlcatlon  Frequency.  Frication  frequency  is  returned  as  a  one 
dimensional  array  of  length  N,  where  N  is  proportional  to  the  length  of 


Original  Waveform 


0.0000 


4331479 


Original  Waveform 


0.7683 


0.0000 


4331479 


Frlcation  Frequency 


Figure  3.2  Formants,  Frication  Frequency 


c 


c 


TK*v<xr,  n  *n  wn  yrw  pu»«!irj«/TirvMT;*vwwwiM,»  '.v  lm  w  lvvw  sr«nwn*n  wv  v  v  ^  «.- 


the  utterance.  It  attempts  to  track  the  frequency  of  fricative  sounds 
in  an  utterance.  During  non-fricative  sounds  the  value  is  below  500, 
and  during  fricative  sounds  the  values  are  above  1000.  Frication 
frequency  is  fair  indicator  of  whether  a  fricative  or  vowel  sound  is 
occurring.  Figure  3.2  shows  an  example  of  Frication  Frequency. 

Additional  Processing.  It  is  necessary  to  perform  additional 
processing  on  SPIRE  results.  This  additional  processing  is  discussed 


s 


below. 


1 


Clipping.  The  last  five  time  slices  of  all  SPIRE  results  are 
clipped  or  ignored.  Due  to  the  predictive  nature  of  the  LPC 
coefficients  computations,  the  last  five  time  slices  can't  be  calculated 
and  are  returned  by  SPIRE  as  zero  values.  As  a  result  of  this,  any 
feature  which  is  built  upon  LPC  coefficients,  such  as  formants,  also  has 
zero  values  in  the  last  five  time  slices.  To  maintain  uniformity,  that 
is,  so  that  any  feature  extracted  from  an  given  utterance  will  have  the 
same  meaningful  length,  the  last  five  time  slices  are  ignored  for  all 
SPIRE  results. 

Median  Filtering.  Due  to  the  erratic  nature  of  the  formant 
tracks,  these  results  are  further  processed  through  a  median  filter. 

The  median  filter  filters  out  unwanted  spikes  in  the  formant  tracks. 

(See  Appendix  B,  for  the  Lisp  function  MEDIAN-FILTER). 

Frequency  Compression.  To  reduce  computation  requirements, 
wide-band,  narrow-band,  and  LPC  spectrum  results  are  compressed  from  2 56 
discrete  frequency  components  down  to  16.  Further,  this  compression  is 
done  so  as  to  emphasize  resolution  in  the  lower  frequencies  and 
de-emphasize  resolution  of  the  higher  frequencies.  Briefly,  the  lower 
132  frequency  components  (0  to  H,125  Hz)  are  linearly  compressed  down 


I 

'■) 

■j 

v 

:: 

i 

\ 


1 


J 

I 

.-V 

.1 

4 

h 

to 

I 

J 

I 


1 

,  1 

s’ 

v' 

v? 

>S 

n 


3-5 


5 


to  12  components,  and  the  upper  124  components  (4,1 25  to  8,000  Hz)  are 
linearly  compressed  down  to  4  components.  It  should  be  noted  here  that 
since  the  speech  waveforms  are  pre-emphisized  by  SPIRE  before  performing 
spectrum  calculations,  frequency  components  are  averaged  instead  of 
added.  (See  Appendix  B,  for  Lisp  function  FREQUENCY-COMPRESS-LFE ) . 

Energy  Normalization.  Energy  normalization  is  performed  on 
each  time  slice  of  wide-band,  narrow-band,  and  LPC  spectrum.  This  is 
done  so  that  energy  disparities  won’t  effect  the  word  recognition 
process.  (See  Appendix  B,  for  Lisp  function  ENERGY-NORMALIZE ) . 

Ready-Utterance .  A  ready-utterance  is  simply  a  name  used  to 
represent  the  set  of  data  which  is  the  result  of  all  the  processing  done 
on  a  given  utterance.  Once  computed,  a  ready-utterances  is  stored  to 
disk  so  that  it  may  be  used  over  and  over  with  out  having  to  re-compute 
all  its  features.  A  ready-utterance  takes  the  Lisp  form  of  a  list  of 
arrays,  where  each  array  corresponds  to  a  processed  SPIRE  result. 

Ready-Template .  A  ready-template  is  simply  a  name  used  to 
represent  the  processed  version  of  the  entire  recognition  vocabulary. 
Each  utterance  of  the  recognition  vocabulary  is  processed  into 
individual  ready-utterances  and  combined  into  one  large  list  of 
ready-utterances.  Again,  this  is  so  that  re-computation  is  reduced. 


Dynamic  Time  Warping  Algorithm( 10 ) 

Introduction .  Dynamic  time  warping  or  dynamic  programming  is  a 
method  by  which  speech  patterns  are  nonlinearly  time  aligned.  This  time 


alignment  is  necessary  due  to  the  nonlinear  time  variations  common  in 


speech.  Dynamic  time  warping  was  invented  by  T.  K.  Vintsyuk.  The 


algorithm  used  for  this  system  is  one  adapted  for  continuous  speech.  It 
was  originally  presented  by  Vintsyuk  and  later  translated  by  Hermann  Ney 
(10:263).  (See  Appendix  B,  for  the  Lisp  function  SCAN-DTW). 

Distance  Arrays.  A  distance  array  is  basically  a  two  dimensional 
array,  M  by  N,  where  M  is  proportional  to  the  length  of  the  template, 
and  N  is  proportional  to  the  length  of  the  utterance  or  test  pattern. 
(Preliminarily,  assume  isolated  speech.)  Both  the  template  and 
utterance  are  represented  by  a  sequence  of  M  and  N  vectors 
respectively.  Each  vector  represents  the  features  of  both  the  template 
and  the  utterance  extracted  at  each  moment  m  and  n  respectively.  Each 
value  of  subscript  (m,  n)  of  the  distance  array  then  represents  the 
vector  distance  between  the  template  at  moment  m  and  the  utterance  at 
moment  n. 

Distance  arrays  are  a  key  element  in  the  word  recognition  process. 
For  isolated  speech,  a  measure  of  utterance-template  similarity  is  taken 
by  tracing  the  path  from  point  (0,  0)  to  point  (M,  N)  of  the  distance 
array  that  results  in  the  smallest  accumulated  distance  of  all  the 
points  in  that  path. 

Figure  3*3  shows  a  simplified  example  of  a  distance  array  using  a 
hypothetical  feature  set  consisting  of  energy  in  three  frequency  bands. 
For  example,  at  any  particular  moment,  the  speech  is  represented  by  a 
3-dimensional  vector  representing  the  energy  in  each  of  the  three 
frequency  bands.  For  isolated  speech,  the  correct  word  would  be 
identified  by  calculating  a  distance  array  between  the  test  word  and 
each  word  in  the  recognition  vocabulary  and  then  choosing  the  the 
template  that  results  in  the  lowest  accumulated  distance.  The  distance 
rule  used  is  Minkowski  1  distance,  also  known  as  the  taxi  distance.  For 


,'V> 


example  the  distance  between  the  vectors  <0,2,5>  and  <5,5,0>  would  be  5 
+  3  +  5  =  13.  In  order  to  find  the  minimum  path  through  the  connected 
speech  distance  array,  a  new  "accumulated  distance  array"  is 
constructed,  shown  in  figure  3.1*.  In  this  array  the  value  of  each  point 
represents  an  accumulated  distance  that  is  equal  to  the  local  distance 
of  that  point  plus  the  minimum  of  the  accumulated  distances  of  all 
possible  preceding  points.  Notice  the  problem  for  isolated  speech  is 
simplified  by  the  fact  the  begin  and  end  points  are  known.  Also  notice 
that  certain  constraints  govern  the  route  of  the  traced  path.  The  path 
must  continue  forward  in  time  for  both  the  template  and  the  test 
pattern.  Therefore  the  path  cannot  go  left  or  down  in  direction,  and 
points  may  not  be  skipped  or  hopped  over. 

One-Stage  Algorithm  for  Connected  Speech.  What  follows  is  a  brief 
summary  of  the  algorithm  given  by  Ney  [10],  which  the  reader  should 
consult  for  further  details.  The  algorithm  is  summarized  as  follows.  A 
composite  distance  array  of  grid  points  (i,j,k)  is  computed  as  shown  in 
figure  3.6.  Individual  time  slices  of  the  test  pattern  are  referenced 
by  index  j.  Individual  time  slices  for  each  template  k  are  referenced 
by  index  i.  In  order  to  find  the  minimum  path  through  the  composite 
array,  a  minimum  accumulated  distance  D(i,j,k)  is  defined  for  each  grid 
point  (i,j,k).  Each  point  D(i,j,k)  is  the  minimum  sum  of  local 
distances  d(i,j,k)  along  some  path  to  grid  point  (i,j,k).  For  any  grid 
point  (i,j,k),  D(i,j,k)  is  found  by  selecting  the  predecessor  with  the 
minimum  accumulated  distance  and  adding  that  accumulated  distance  to  the 
local  distance  d(i,j,k).  The  transition  rules  consist  of 
within-template  rules  and  between-teraplate  rules.  Thus  for  the 


template  interior,  j  >  1,  the  recursion  rule  is, 

D(i,j,k)  =  d(i,j,k)  +  min[D(i-1 , j,k) , 

D ( i— 1 , j-1 ,k) ,  D ( i , j-1 , k ) ]  (1) 

At  template  boundaries  with  j  =  1,  the  recursion  rule  is, 

D(i,j,k)  =  d(i,j,k)  +  min[D( i-1 , J(k  ),k  )]  (2) 

where  k  =  1,...,K.  Figure  3.7  depicts  within-template  and 
between-template  transition  rules  for  connected  speech  distance  arrays. 
By  keeping  track  of  where  the  path  crosses  template  boundaries,  the 
problem  of  boundary  detection  in  the  test  pattern  is  handled 
automatically. 

Time  Distortion  Penalties.  Ideally  the  total  accumulated  distance 
through  the  distance  array  should  be  independent  of  the  slope  of  the 
path  in  order  to  allow  all  types  of  time  axis  distortion.  Therefore, 
the  algorithm  applies  time  distortion  penalties  using  slope  dependent 
weights.  Depending  on  the  three  directions,  horizontal,  diagonal,  and 
vertical,  the  local  distance  is  multiplied  by  the  weights  (t  +  a),  1, 
and  b  prior  to  evaluating  the  dynamic  programming  recursion: 

D(i,j,k)  =  min  [(1  +  a)  .  d(i,J,k)  +  D(i-1,j,k), 

d(i,j,k)  +  D ( i-1 , j-1 , k ) , 
b  •  d( i, j-1 ,k)  +  D( i, j-1 . k) ] 

(In  the  actual  algorithm,  this  recursive  formula  is  not  actually 


implemented  recursively  but  forwardly  as  the  accumulated  distance  array 
is  computed.)  The  number  of  local  distances  per  input  frame  is  thus  1  + 
(a/2)  for  slope  1/2,  1  for  slope  1,  and  1  ♦  b  for  slope  2.  Figure  3.5 
depicts  the  time  distortion  penalties.  Weights  of  a  =  1  and  b  =  1/2  are 
typically  used 


Summary  of  Steps.  A  summary  of  connected  speech  algorithm  is  given 


as 


follows. 


j 

Step  1)  Initialize  D(1,j,k)  =  ^  d(1,n,k). 

n=1 

Step  2) 

a)  For  i  =  2,...,  N,  do  steps  2b-2e. 

b)  For  k  =  1,...,  K,  do  steps  2 c-2e.  #  # 

c)  D(i,1,k)  =  d(i,1,k)  +  min[D(i-1 , j( k  ),k  )]. 

d)  For  j  =  2,...,j(k),  do  step  2e . 

e)  D(i, j,k)  =  min  [(1  +  a)  •  d(i,j,k)  +  D(i-1,j,k), 

d(i, j,k)  +  D ( i— 1 , j-1 , k) , 
b  •  d(i,j-1,k)  +  D( i, j-1 , k) ] . 


Step  3)  Trace  back  the  path  from  the  grid  point  at  a  template 

ending  frame  with  the  minimum  total  distance  using  array 
D(i,j,k)  of  accumulated  distances. 

The  unknown  sequence  is  recovered  in  step  3)  above  by  tracing  back  the 
decisions  taken  by  the  "minimum"  operator  at  each  grid  point.  (10:265) 


Id 

5 

< 

cr 

u. 

Ld 

I- 


UJ 

K 

< 

_J 

Q- 

5 

Ld 

h- 


TIME  FRAME  I  OF 
INPUT  PATTERN 


Figure  3.5  Time  Distortion  Penalties 


Storage .  In  reality,  the  whole  accumulated  array  need  not  be 
computed  and  stored  at  once.  To  perform  the  dynamic  programming 
recursions  from  a  time  frame  i,  only  a  small  portion  of  the  the  complete 
array  D(i,j,k)  of  accumulated  distances  is  needed.  Thus  using  only  one 
column  of  storage,  D(j,k),  the  recursions  (1)  and  (3)  are  carried  out  by 
proceeding  along  the  time  axis  of  the  test  pattern  and  updating  the 
storage  column  point  by  point.  Using  this  method  causes  the  details  of 
the  path  to  be  lost,  and  backtracking  information  (boundary  crossings) 
must  be  stored  along  the  way.  Two  1-dimensional  arrays,  length  N,  are 
used  for  this  purpose.  The  words  and  boundaries  are  finally  found  by 
tracing  back  through  the  1-dimensional  arrays  from  end  point  to  begin 
point,  etc.,  until  the  beginning  of  the  test  pattern  is  reached.  Figure 
3.8  depicts  the  idea  of  backpointers  for  individual  grid  points  while 
figure  3.9  depicts  the  traceback  procedure.  A  flow  diagram  for  the 
One-Stage  Dynamic  Time  Warping  Algorithm  for  Connected  Speech  is  shown 
in  figure  3.10.  For  more  details  refer  to  Ney  [10]  or  Appendix  B  for 
the  Lisp  function  SCAN-DTW. 


TIME  FRAME 
TEMPLATE 


|»A* 


Ll. 

Q 


TIME  FRAME  i  OF 
INPUT  PATTERN 


Figure  3.8  Backpointers  from  three  preceding  grid 
points  (i,j,k)  to  their  starting  frames 


BACKPOINTER 
MINIMUM  TEMPLATE 


TIME  FRAMES  i  OF  INPUT  PATTERN 


Figure  3.9  The  Backtracking  Procedure 


3-15 


initialize  arrays  of  accumulated  and  backpointers 


LOOP  OVER  time  frames  of  the  input  pattern. 

~  z . HI -  ~ - 

LOOP  over  templates. 

i l  I 

evaluate  dynamic  programming  recursion  according 
TO  BETWEENTEMPLATE  rules. 

-  UPDATE  THE  COLUMN  ARRAY  OF  ACCUMULATED  DISTANCES 
,  -  UPDATE  THE  COLUMN  ARRAY  OF  BACKPOINTERS. 

LOOP  OVER  TIME  FRAMES  OF  THE  TEMPLATES. 

I  4 

EVALUATE  DYNAMIC  PROGRAMMING  RECURSION  ACCORDING 
TO  WITHIN-TEMPLATE  RULES. 

-  UPDATE  THE  COLUMN  ARRAY  OF  ACCUM.  DISTANCES 

-  UPDATE  THE  COLUMN  ARRAY  OF  BACKPOINTERS. 

J-Jr _ 

LOOP  CONTROL 

|  - }  - 

LOOP  CONTROL 
- — - 

KEEP  TRACK  OF  THE  TEMPLATE  WITH  MINIMUM  ACCUMULATED  DISTANCE 
AT  ITS  ENDING  FRAME  IN  A  "FROM  TEMPLATE"  ARRAY 

KEEP  TRACK  OF  BACKPOINTERS  AT  THE  ENDING  FRAME  OF  THE 
CORRESPONDING  TEMPLATE  IN  A  FROM  FRAME  ARRAY. 

I - - - - * - 

LOOP  CONTROL 

RECOVER  THE  SEQUENCE  OF  TEMPLATES: 

-  START  FROM  THE  THE  TEMPLATE  WITH  THE  MINIMUM  ACCUMULATED 
DISTANCE  AT  ITS  ENDING  FRAME. 

-  BACKTRACK  THE  SEQUENCE  OF  TEMPLATES  USING  THE  "FROM  FRAME" 
AND  "FROM  TEMPLATE"  ARRAYS  UP  TO  THE  BEGINNING  FRAME  OF 
THE  INPUT  PATTERN. 


IV.  Results  and  Discussion 


Introduction 

The  purpose  of  this  chapter  is  to  present  the  results  of  an 
investigation  into  the  applicability  of  various  feature  sets  to 
connected-speech  recognition.  The  chapter  begins  by  examining  features 
one  by  one  and  observing  their  distance  array  contours.  Distance 
distributions  for  the  distance  arrays  are  given  in  the  form  of 
histograms.  Finally,  individual  features  and  combinations  of  features 
are  tested  at  connected-speech  recognition  using  the  one-stage  dynamic 
time  warping  algorithm  for  connected  speech  of  chapter  three. 

The  Distance  Array  Contour 

The  distance  array  contour  is  a  useful  way  of  observing  a  feature 
set's  applicability  to  speech  recognition.  The  distance  array  contour 
is  made  by  calculating  an  array  of  distances  between  a  single  known 
template  word  and  a  single  known  utterance.  Distances  calculated  are 
Minkowski  1  distances  (taxi  distance).  Then  the  distance  array  is 
plotted  with  distances  below  a  certain  threshold  represented  by  black, 
and  distances  above  the  threshold  represented  by  white.  The  threshold 
is  determined  by  trial  and  error  until  about  half  the  area  is  dark  and 
half  is  white.  This  threshold  varies  for  different  feature  sets. 

Figures  4.1  through  4.7  show  distance  array  contours  using  the  template 
word  "three"  and  the  utterance  "4-3-3-1 -4-7-9",  along  with  corresponding 
distribution  histograms. 

Wide-Band  Spectrum.  Figure  4.1  shows  the  distance  array  contour 
using  Wide-Band  Spectrum  as  a  feature  set.  Notice  where  the  word 
"three"  appears  in  the  test  pattern,  there  are  diagonal  dark  patterns 


extending  from  bottom  to  top  of  the  distance  array  contour.  Those  dark 
diagonal  patterns  represent  the  occurrences  of  "three"  in  the  test 
pattern  matching  up  with  the  template  version  of  "three".  Even  though 
the  original  waveforms  of  the  two  occurrences  of  "three"  are  markedly 
different,  this  feature  is  able  to  show  agreement  with  each  and  the 
template . 

Narrow-Band  Spectrum.  Figure  4.2  shows  distance  array  contour 
using  narrow-band  spectrum  as  a  feature  set.  This  contour  looks  very 
similar  to  the  one  for  wide-band  spectrum  except  that  the  patterns  are 
more  distinct.  There  is  less  "noise"  in  the  narrow-band  spectrum 
representation . 

LPC  Coefficients.  Figure  *4.3  shows  the  distance  array  contour 
using  LPC  Coefficients  as  a  feature  set.  In  this  feature  set,  the 
distances  are  calculated  from  the  actual  LPC  filter  coefficients.  In 
this  case  the  patterns  are  not  clear.  Using  this  feature  in  this  way 
performs  poorly  in  terms  of  showing  agreement  between  the  template  and 
the  test  pattern. 

LPC  Spectrum.  Figure  4.4  shows  the  distance  array  contour  using 
LPC  Spectrum  as  a  feature  set.  In  this  feature  set,  the  distance  are 
calculated  from  the  spectral  components  derived  from  the  LPC  filter 
coefficients.  In  this  contour,  the  two  occurrences  of  the  word  "three" 
appear  even  more  clearly  than  wide-band  and  narow-band  spectrum. 

Formants .  Figure  4.5  shows  the  distance  array  contour  using 
Formants  as  a  feature  set.  This  feature  set  works  well  during  vowels 
sounds,  but  works  erratically  during  fricatives  when  the  formant  tracker 
loses  track.  This  contour  shows  that  formants  can  show  agreement 
between  vowel  sounds  but  not  fricatives. 


Frication  Frequency.  Figure  4.6  shows  the  distance  array  contour 
using  Frication  Frequency  as  a  feature  set.  This  contour  fails  to  show 


any  agreement  between  the  template  and  the  input  pattern. 

Zero  Crossing  Rate.  Figure  4.7  shows  the  distance  array  contour 
using  Zero  Crossing  Rate  as  a  feature  set.  This  contour  also  fails  to 
show  any  agreement  between  the  template  and  the  input  pattern. 

Recognition  of  Connected  Speech. 

In  order  to  fully  observe  the  feature  sets’  applicability  to 
connected-speech  recognition  different  features  and  combinations  thereof 
are  tested  using  the  one-stage  dynamic  time  warping  algorithm  for 
connected  speech  proposed  by  Ney  (10).  Appendix  C  contains  sample 
results  of  the  recognition  system  using  the  various  feature  sets  for 
speaker-dependent  continuous  speech  recognition.  In  these  figures,  the 
template  set  is  displayed  vertically  on  the  left  and  the  test  pattern 
horizontally  on  the  bottom.  The  composite  distance  array  contour  is 
shown  in  the  middle.  The  template  word  boundaries  are  marked  by 
horizontal  lines.  The  vertical  lines  represent  the  word  boundaries  with 
the  test  pattern  as  computed  by  the  recognition  system.  Below  the  test 
pattern  waveform  are  the  words  as  recognized  by  the  system. 

Speaker-Dependent  Results.  Feature  sets  are  tested  first  for 
speaker-dependent  performance.  In  this  case  the  template  patterns  and 
the  test  patterns  are  made  by  the  same  speaker.  The  features  tested 
here  are  wide-band,  narrow-band,  and  LPC  spectrum,  formants,  and  a 
combination  of  formants,  LPC  spectrum,  and  frication  frequency.  Table 
4-1  shows  results  for  each  of  these  feature  sets. 


Template  Narrow-Band  Spectrnm  Distance  Array 


tance  Array  Contour,  Narrow-Band  Spectrum 


Template  LPC  Coefficients  Distance  Array  (Threshold  *  5) 


Figure  14.3  Distance  Array  Contour,  LPC  Coefficients 


Template 


Figure  4.4  Distance  Array  Contour,  LPC  Spectrum 


Template  Formants  Distance 


Test  Pattern 


Template  Zero  Crossing  Rate  Distance  Array  (Threshold  *  1.0) 


Figure 


Wide-Band  Spectrum.  The  feature  set  consisting  of  only 


wide-band  spectrum  performs  only  fair,  correctly  recognizing  29  out  of 
38  digits,  spoken  in  five  to  seven  word  utterances  for  the  speaker 
"RGD". 

Narrow-Band  Spectrum.  This  feature  set  performs  about  the 
same  as  wide-band  spectrum,  also  correctly  recognizing  29  out  of  38 
digits,  spoken  in  five  to  seven  word  utterances  for  the  speaker  "RGD". 

Formants.  The  feature  set  consisting  of  only  the  first  and 
second  formant  frequencies  performed  surprisingly  well,  recognizing  31 
out  of  38  spoken  in  five  to  seven  word  utterances  for  the  speaker 
"RGD".  The  good  performance  of  this  feature,  considering  how  the 
formant  tracts  are  lost  during  fricatives  leads  to  the  next  feature  set, 
which  is  a  combination  of  LPC  Spectrum  and  Formants. 

LPC  Spectrum.  The  feature  set  consisting  of  only  LPC 
spectrum  performs  the  best  of  the  three  spectrum  features,  correctly 
recognizing  35  out  of  38  spoken  in  five  to  seven  word  utterances  for  the 
speaker  "RGD". 

LPC  Spectrum,  Formants,  Frication  Frequency.  This  feature  set 
consists  of  a  combination  of  LPC  spectrum  and  formants.  Frication 
frequency  is  used  as  a  "gate"  to  determine  whether  vowels  or  fricatives 
are  present.  Zero  crossing  rate  could  also  be  used  as  a  gate  between 
vowel  and  fricative  sounds,  because  a  rate  between  about  300  and  900 
usually  indicates  a  vowel  sound.  However,  zero  crossing  rate  goes  to 
zero  during  very  low  energy  periods  as  shown  by  figure  !4.8. 


282828  Frication  Frequency  2.0000 


0.5537 


282828  Original  Waveform  2.0000 


282828  Wide-Band  Spectrum  Slice 


Figure  M . 8  Frication  Frequency  vs.  Zero  Crossing  Rate 


Therefore,  moments  between  silence  and  frication,  as  the  zero  crossing 
rate  rises  from  0  above  900,  would  be  mistaken  for  vowel  sounds.  Using 
frication  frequency  as  a  "gate"  enables  formant  tracts  to  be  used  while 
substituting  LPC  spectrum  distances  when  the  formant  frequencies  are  not 
valid.  This  feature  set  performed  very  well,  correctly  recognizing  all 
38  of  the  digits  spoken  by  "RGD".  Even  the  troublesome  "two-eight-two- 
eight-two-eight"  combination  was  correctly  recognized. 


LPC  Spectrum 


Wide-Band 

Narrow-Band 

LPC 

Formants 

Utterance 

Spectrum 

Spectrum 

Formants 

Spectrum 

Fric.  Freq. 

4331479 

7. 0/7.0 

7. 0/7.0 

7. 0/7.0 

7. 0/7.0 

7. 0/7.0 

282828 

0. 0/6.0 

0. 0/6.0 

4. 0/6.0 

5. 0/6.0 

6. 0/6.0 

2468 

3. 0/4.0 

3. 0/4.0 

4. 0/4.0 

4. 0/4.0 

4. 0/4.0 

28318 

3. 0/5.0 

3. 0/5.0 

2. 0/5.0 

3. 0/5.0 

5. 0/5.0 

012345 

6. 0/6.0 

6. 0/6.0 

6. 0/6.0 

6. 0/6.0 

6. 0/6.0 

56789 

5. 0/5.0 

5. 0/5.0 

5. 0/5.0 

5. 0/5.0 

5. 0/5.0 

01379 

5. 0/5.0 

5. 0/5.0 

4. 0/5.0 

5. 0/5.0 

5. 0/5.0 

Total 

29.0/38.0 

29.0/38.0 

32.0/38.0 

35.0/38.0 

38.0/38.0 

Percent 

63? 

63? 

84? 

92? 

100? 

Table  4.1  Speaker  Dependent  Feature  Results 


Speaker-Independent  Results.  Two  feature  sets  are  used  to  examine 
speaker-independent  connected  speech  recognition.  LPC  spectrum,  since 
it  is  so  commonly  used  in  practice,  is  used  as  a  baseline.  A  possibly 
improved  feature  set,  using  LPC  spectrum,  formants,  and  frication 
frequency  is  also  used.  Speaker-independent  performance  is  examined  by 
simply  trying  the  system  out  using  various  combinations  of  template  sets 
and  test  patterns,  of  course  each  by  different  speakers.  Finally, 
multiple  speaker  template  sets  are  tested. 


The  improved  feature  set  of  LPC  spectrum,  frication  frequency,  and 
formants,  is  implemented  differently  than  for  the  speaker-dependent 
case.  Formant  frequencies  are  rather  consistent  for  given  vowels  sounds 
for  a  given  speaker.  However,  formant  frequencies  of  different  speakers 
uttering  vowels  sounds  that  are  perceived  as  being  the  same  can  be  quite 
different.  Figure  4.9  shows  a  plot  of  the  first  formant  frequency  (FI) 
versus  the  second  formant  frequency  (F 2)  for  a  population  of  speakers 
uttering  vowel  sounds  common  to  the  English  language.  Those  grouped 
together  were  perceived  as  the  same  sound.  It  is  clear  from  figure  4.9 
that  simple  Minkowski  1  distances  are  insufficient  since  points  from 
separate  groups  can  have  Minkowsli  1  distances  that  are  smaller  than 
points  from  within  the  same  group.  Therefore,  it  is  necessary  to  alter 
the  way  these  individual  features  are  combined  for  the  speaker-dependent 
case. 

First,  a  distance  array  is  computed  using  only  LPC  spectrum.  As  in 
the  speaker-dependent  case,  frication  frequency  is  used  to  locate  valid 
formant  frequencies.  Then,  each  point  in  the  LPC  spectrum  distance 
array  is  multiplied  by  0.4  and  thereby  emphasizing  "agreement"  if,  (1) 
that  point  results  from  a  valid  vowel  sound  according  to  frication 
frequency  in  both  the  template  and  the  test  pattern,  and  (2)  the  first 
and  second  formants  from  both  the  template  and  the  test  pattern  fall 
within  the  same  group.  Figure  4.10  shows  the  groupings  used  by  the 
algorithm  for  each  vowel  sound. 


FREQUENCY  OF  F,  IN  Hz 


FREQUENCY  OF  F,  IN  Hz 


Figure  4.9  First  Formant  vs.  Second  Formant 
Source  (12:44) 


Single-Speaker  Template  Sets.  Table  4.2  shows  results  for 
various  template  and  speaker  combinations.  In  many  cases,  the  addition 
of  formant  information  improved  recognition  accuracy. 


Template 

Speaker 

LPC  Spectrum 

Plus  Formants 

JONES 

RGD 

29.5/38.0 

34.5/38.0 

JONES 

SKIP 

7.5/18.0 

1 1 .5/18.0 

RGD 

SKIP 

15.5/18.0 

18.0/18.0 

RGD 

JONES 

26.0/33.0 

31 .0/33.0 

SKIP 

RGD 

26.5/38.0 

28.0/38.0 

SKIP 

JONES 

25.0/33.0 

31.0/33.0 

TOTAL 

130.0/178.0 

154.0/178.0 

PERCENT 

73? 

87? 

Table  4.2  Single  Template  Speaker  Independent  Results 


Multi-Speaker  Template  Sets.  Table  4.3  shows  results  for 
various  multi-speaker  template  and  speaker  combinations.  Using 
multi-speaker  templates  further  improved  recognition  accuracy. 


Template  Speaker 

SKIP  4  RGD  JONES 

JONES  4  RGD  SKIP 

SKIP  4  JONES  RGD 


LPC  Spectrum 
27.5/33.0 
12.5/18.0 
31 .5/38.0 


Plus  Formants 
31.0/33.0 
18.0/18.0 
35.5/38.0 


TOTAL 

PERCENT 


71.5/89.0 

80? 


84.5/89.0 

95? 


Table  4.3  Multi-Template  Speaker  Independent  Results 


Overall  Results  Using  LPC  Spectrum.  Table  4.4  shows  the  overall 
results  including  both  single  and  multi-templates  using  only  LPC 
spectrum  as  a  feature  set. 


Temp 

.late  (LPC 

Spectrum) 

Utterance 

RGD 

JONES 

SKIP 

SKIP 

RGD 

JONES 

SKIP 

JONES 

RGD 

RGD: 

012345 

- 

5. 0/6.0 

6. 0/6.0 

- 

6. 0/6.0 

- 

4331479 

- 

7. 0/7.0 

5. 0/7.0 

- 

7. 0/7.0 

- 

56789 

- 

2. 5/5.0 

1. 5/5.0 

- 

2. 5/5.0 

- 

28318 

- 

2. 0/5.0 

3. 0/5.0 

- 

2. 0/5.0 

- 

01379 

- 

3. 0/5.0 

5. 0/5.0 

- 

5. 0/5.0 

- 

2468 

- 

4. 0/4.0 

3. 0/4.0 

- 

3. 0/4.0 

- 

282828 

- 

6. 0/6.0 

3. 0/6.0 

6. 0/6.0 

JONES: 

4331479 

6. 0/7.0 

- 

7. 0/7.0 

7. 0/7.0 

- 

- 

2555276 

7. 0/7.0 

- 

4. 0/7.0 

7. 0/7.0 

- 

- 

28318 

2. 5/5.0 

- 

3. 0/5.0 

3. 0/5.0 

- 

- 

2377097 

3. 5/7.0 

- 

4. 0/7.0 

3. 5/7.0 

- 

- 

8351561 

7. 0/7.0 

- 

7. 0/7.0 

7. 0/7.0 

- 

SKIP: 

1234 

4. 0/4.0 

1 .0/4.0 

- 

- 

- 

2. 0/4.0 

1549768203 

9.0/10.0 

4.0/10.0 

- 

- 

- 

8.0/10.0 

2468 

2. 5/4.0 

2. 5/4.0 

2. 5/4.0 

Table  4.4 

Overall 

Results 

LPC  Spectrum 


•  .  -  .  A  a  :  d*  1 :  t  ■  i 


j.'  •.<  «  i  i J1'»  «va 


Overall  Results  Using  LPC  Spectrum,  Formants,  and  Frication 


Frequency.  Table  4.5  shows  the  overall  results  including  both  single 
and  multi-templates  using  only  LPC  spectrum,  formants,  and  frication 
frequency  combined  as  a  feature  set. 


Template  (LPC  Spectrum  +  Formants) 


Utterance 


JONES  SKIP 


JONES 

SKIP 


JONES 

RGD 


012345 

- 

6. 0/6.0 

6. 0/6.0 

- 

6. 0/6.0 

- 

4331479 

- 

7. 0/7.0 

5. 5/7.0 

- 

7. 0/7.0 

- 

56789 

- 

3. 5/5.0 

2. 5/5.0 

- 

3. 5/5.0 

- 

28318 

- 

5. 0/5.0 

1. 0/5.0 

- 

5. 0/5.0 

- 

01379 

- 

4. 0/5.0 

5. 0/5.0 

- 

5. 0/5.0 

- 

2468 

- 

3. 0/4.0 

3. 0/4.0 

- 

3. 0/4.0 

- 

282828 

- 

6. 0/6.0 

5. 0/6.0 

- 

6. 0/6.0 

“ 

JONES: 

4331479 

7. 0/7.0 

- 

7. 0/7.0 

7. 0/7.0 

- 

- 

2555276 

7. 0/7.0 

- 

7. 0/7.0 

7. 0/7.0 

- 

- 

28318 

4. 5/5.0 

- 

4. 0/5.0 

5. 0/5.0 

- 

- 

2377097 

5. 5/7.0 

- 

6. 0/7.0 

6. 0/7.0 

- 

- 

3351561 

7. 0/7.0 

- 

7. 0/7.0 

7. 0/7.0 

- 

SKIP: 

1234 

4. 0/4.0 

2. 0/4.0 

- 

- 

- 

4. 0/4.0 

1549768203 

10.0/10.0 

7.0/10.0 

- 

- 

- 

10.0/10.0 

2468 

4. 0/4.0 

2. 5/4.0 

4.0/4 . 0 

Table  4.5 

Overall 

Results 

LPC  Spectrum,  Formants, 
Frication  Frequency 


-IV.N  .V  . 

* ."V" .  *  -  ■  o'.:  -  * 


V.  Conclusions  and  Recommendations 

Introduction 

The  purpose  of  this  chapter  is  to  discuss  conclusions  that  may  be 
drawn  based  on  the  performance  of  this  system  as  well  as  to  give 
recommendations  for  further  research  in  the  area  of  speaker-independent 
continuous-speech  recognition. 

Conclusions . 

This  thesis  is  successful  in  producing  a  rather  robust  system  for 
continuous-speech  recognition.  It  is  shown  here  that  Ney's  algorithm 
for  connected  speech  works  quite  well.  The  idea  of  using  template  sets 
made  up  of  multiple  speech  features  is  also  shown  to  be  advantageous. 
Results  reveal  that  using  formant  information  can  significantly  improve 
recognition  accuracy,  especially  in  the  area  of  speaker-independent 
applications . 

Recommendations. 

Environmental  Stress.  As  described  in  chapter  3,  this  system  was 
tested  with  speech  patterns  virtually  free  of  background  noise.  It 
would  be  interesting  to  study  its  performance  under  such  conditions  as 
background  noise  ie.,  cockpit  noise.  The  Armstrong  Aerospace  Medical 
Research  Laboratory  at  Wright-Patterson  AFB  has  excellent  facilities  for 
recording  speech  under  noise  conditions. 

Tailored  Template  Sets.  From  the  results  of  this  system  it  still 
isn’t  clear  whether  completely  redundant  template  sets  are  necessary. 
They  seem  to  be  useful  handling  different  pronunciations  of  certain 
words  such  as  "eight"  with  or  without  the  "t"  sound  at  the  end. 


Unfortunately,  redundant  template  sets  pay  a  high  price  in  terms  of 
computational  intensity.  A  better  approach  may  be  to  store  a  few 
carefully  selected  template  sets  with  only  certain  words  redundant  and 
let  the  user  select  the  best  one  for  him.  This  would  greatly  simplify 
the  training  process. 

Additional  Features.  Although  the  system  was  able  to  discriminate 
between  different  vowel  sounds  well,  it  was  not  able  to  discriminate 
between  similar  fricative  sounds.  It  would  have  trouble  with  something 
like  "carp"  versus  "tarp".  A  logical  extension  would  be  to  add  ways  to 
discriminate  such  sounds. 

Syntactic  Rules.  Even  humans  have  trouble  identifying  spoken 
utterances  without  the  aid  of  syntax.  Ney  (10)  describes  methods  for 
adapting  the  algorithm  to  include  such  constraints.  Also,  currently, 
the  algorithm  will  apply  every  bit  of  the  test  pattern  to  some 
template.  It  has  no  way  of  handling  words  that  are  not  part  of  the 
vocabulary.  Syntactic  constraints  described  by  Ney  could  possibly  be 
adapted  to  handle  words  not  in  the  vocabulary. 

Dedicated  Hardware.  Although  Ney’s  algorithm  is  very  efficient, 
dedicated  hardware  would  be  preferred  for  its  interactive  use.  Hardware 
to  perform  real  time  LPC  analysis  is  commonly  available.  The  DoD 
standard  is  known  as  LPC-10.  The  next  step  would  be  to  implement  the 
dynamic  time  warping  algorithm  in  hardware  as  well.  Such  a  system  then 
could  conceivably  by  operated  in  real  time. 

Summary 

In  summary,  this  thesis  shows  that  using  additional  speech  features 


(formants)  can  be  successfully  applied  to  the  problem  of 


speaker-independent  continuous  speech  recognition.  Presumably,  further 
improvements  could  be  made  by  carefully  utilizing  other  features  of 
speech.  Consequently,  further  research  in  this  area  could  help  to 
ultimately  solve  the  problem  of  speech  recognition. 


5-3 


-  * -  mode:  lisp;  pscksgs:  spirs;  base: 

SPIKE  —  Speech  snd  Phonetics 


10 

Interactive  Research  Environment 


ATTRIBUTE-DEfAULTS 


(c)  Copyright  1983,  Massachusetts  Institute  of  Technology,  All  Rights  Reserved 


c 


© 


c 


(def ine-attribute  "Zero  Crossing  Rate"  zero-crossing-rate-flavor 
( sampled-att ribute-uindow )  nil 
: vaveform-attribute-name  "Original  Waveform” 

: ana lys is-rate  200. 

:analysis-window-size  .020 
:noise-threshold  40.) 

(def ine-attribute  "Vers  Zero  Crossing  Rata"  zero-crossing-rate-flavor 
nil  nil 

: vaveform-attribute-name  "Original  Waveform" 

; analysis-rate  400. 

; analysis-vindou-size  .020 
:noise-threshold  40.) 

( def ine-at t r ibute  "LPC  Predictor  Coefficients"  lpc-flavor  ( indexed-att ribute-uindow ) 
( ( "LPC  Cain  Term"  (:gain)  sampled-att ribute-uindow ) ) 
:waveform-attribute-name  "Original  Waveform" 

:filter-spec  (:bandwidth  78.) 

: analysis-rate  200.) 

( def ine-attribute  "LPC  Spectrum"  lpc-spectral-f lavor 
nil  nil 

:predictor-attribute-name  "LPC  Predictor  Coefficients" 

:number-of-points  256.) 

( def ine-attribute  "LPC  Spectrum  Slice"  spectrum-alice-f lavor 
(spectr a 1-slice-att ribute-uindow)  nil 
.-spectrum-name  "LPC  Spectrum” 

;cursor-name  :cursor-time ) 

(def ine-attribute  "LPC  Spectrum  Slice  (marker)"  spectrum-slice-flavor 
( spectral-slice-attribute-window)  nil 
: spectrum-name  "LPC  Spectrum" 

:cursor-name  :marker-time ) 

(def ine-attribute  "Energy  —  0  Hz  to  5000  Hz"  energy-f rom-wa vef orm-f la vor 
( sampled-attribute-window)  nil 
:waveform-attribute-name  "Original  Waveform" 

:analysis-rate  200.  ;  383. 

:filter-type  -.hamming 
:filter-spec  ( rbandwidth  78.0) 

:number-of-points  256. 

:preemphasis?  -.default 
:f req-lo-bound  0 
: f req-hi-bound  5000.) 

(define-attribute  "Vers  Total  Energy"  energy-f rom-wavef orm-f lavor 
nil  nil 

:waveform-attribute-name  "Original  Waveform" 

:analysis-rate  400. 

:filter-type  :hamming 
:filter-spec  ( ;bandwidth  78.0) 

:number-of-points  128. 
tpreemphasis?  :default 
: f req-lo-bound  nil 
; f req-hi-bound  nil) 

(define-attribute  "Total  Energy"  energy-f rom-wavef o rm-f la vor 
(sampled-attribute-window)  nil 
: vaveform-attribute-name  "Originel  Waveform" 

.-analysis-rate  200. 


A -2 


t 


:filtar-typa  :  hamming 
:filtar-spac  (: bandwidth  78.0) 

:numbar-of-points  128. 

:praamphasis?  rdafault 
:f raq-lo-bound  nil 
:f raq-hi-bound  nil) 

(daf ina-attributa  "Energy  —  120  Hi  to  440  Hz"  anargy-f rom-wavaf orra-f lavor 
( aamplad-attributa-window)  nil 
:wavaform-attributa-nama  "Original  Wavaform” 

:analysis-rata  200. 

:filtar-typa  :hamming 
:filtar-spac  (: bandwidth  78.0) 

:numbar-of-points  256. 

:praamphasis7  :dafault 
:f raq-lo-bound  120. 

: f raq-hi-bound  440.) 

(daf ina-attributa  "Vara  Enargy  —  125  Hz  to  750  Hz”  anargy-f rom-wavaf orm-f lavor 
nil  nil 

:wavaf orm-attributa-nama  "Original  Wavaform" 

:analysis-rata  400. 

:filtar-typa  rhamming 
:filtar-spac  ( :bandwidth  78.0) 

:nuad>ar-of-points  128. 
spraamphasis?  :dafault 
: f raq-lo-bound  125. 

:f raq-hi-bound  750.) 

(daf ina-attributa  "Enargy  —  125  Hz  to  750  Hz"  anargy-f rom-wavaf orm-f lavor 
( aamplad-attributa-window)  nil 
:wavaforn-attributa-nama  "Original  Wavaform'1 
:analysis-rata  200. 

:filtar-typa  :hamming 
:filtar-»pac  (: bandwidth  78.0) 

:nuabar-of-points  128. 

:praamphasis7  :dafault 
: f raq-lo-bound  125. 

: f raq-hi-bound  750.) 

(daf  ina-attributa  "Enargy  —  640  Hr.  to  2800  Hz"  anargy-f  rom-wavaf  orm-f  lavor 
( samplad-att ributa-window )  nil 
:wavafor«-attributa-nama  "Original  Wavaform" 

:analysis-rata  200. 

:filtar-typa  :ham»ing 
:filtar-»pac  ( : bandwidth  78.0) 

:nu*bar-of-points  256. 

:praamphasis?  :dafault 
:f raq-lo-bound  640. 

:f raq-hi-bound  2800.) 

( daf ina-attributa  "Enargy  —  3400  Hz  to  5000  Hz"  anargy-f rom-wavaf orm-f lavor 
(aamplad-attributa-window)  nil 
:wavafora-attributa-naaa  "Original  Wavaform" 

:analysia-rata  200. 

:filtar-typa  :haaming 
:filtar-apac  ( :bandwidth  78.0) 

:nuabar-of-points  256. 

:praanphaais7  :dafault 
: f raq-lo-bound  3400. 

: f raq-hi-bound  5000.) 

(daf ina-attributa  "Frication  Fraquancy"  anargy-parcantila-f lavor 
( samplad-att ributa-window )  nil 
:spactral-attributa-nama  "LPC  Spactrum" 

:anargy-f raetion  .25) 

(daf ina-attributa  "LPC  Cantar  of  Gravity"  anargy-maan-f lavor 
(aamplad-attributa-window)  nil 
:apactral-attributa-nama  "LPC  Spactrum") 

(daf ina-attributa  "Formants"  spactral-paaks-f lavor 
( indaxad-att ributa-window )  nil 


A-3 


:spectral-attribute-name  "LPC  Spectrum" 
:number-of-peaks  4) 


(daf ina-attributa  "Narrow-Band  Spectrum"  fft-speetral-f lavor 
nil  nil 

:waveform-attribute-name  "Original  Waveform" 

: analysia-rata  200. 

:filter-type  :hamming 
:filter-spec  (:bandwidth  78.0) 

:number-of-points  2S6.) 

( daf ina-attributa  "Narrow-Band  Spactrun  Slica"  spectrum-slice-flavor 
( spectral-slice-attribute-window  )  nil 
: spactrum-naaa  "Narrow-Band  Spectrum" 

:curaor-naaa  :curaor-tiaa ) 

(daf ina-attributa  "Narrow-Band  Spectrum  Slica  (marker)”  spect rum-s 1 ice-f lavor 
( spact ral-slica-attributa-window)  nil 
: spect rum-name  "Narrow-Band  Spactrun" 

:cursor-naaa  :narkar-tina ) 

(dafina-attributa  "Narrow-Band  Spactral  Slica"  ff t-spact ral-slica-f lavor 
( spact ral-s 1 ica-a tt ri but a- window)  nil 
:wavaf orm-attributa-nama  "Original  Wavaforn" 

:cursor~nama  : cursor-tima 
:filtar-typa  :hanning 
:filtar-spac  ( :bandwidth  78.0) 

:numbar-of-points  256.) 

(daf ina-attributa  "Narrow-Band  Spactral  Slica  (markar)"  fft-spactral-slica-f lavor 
( spact ral-slica-attributa-window)  nil 
:wavaf orm-attributa-nama  "Original  Wavaform” 

:cursor-nama  :markar-tima 
:filtar-typa  :hamming 
:filtar-spac  (: bandwidth  78.0) 

:numbar-of-points  256.) 

(dafina-attributa  "Wida-Band  Spactrum"  fft-spactral-f lavor 
nil  nil 

:wavaf orm-attributa-nama  "Original  Wavaform" 

: analysis-rata  200. 

:filtar-typa  :hamming 
:filtar-spac  ( rbandwidth  300.0) 

:numbar-of-points  256.) 

(daf ina-attributa  "Wida-Band  Spactrum  Slice"  spact rum-s 1  ice-flavor 
( spectral-slice-attribute-window)  nil 
: spectrum-name  "Wida-Band  Spactrum" 

:cursor-name  :cursor-time ) 

( daf ina-att ributa  "Wida-Band  Spactrum  Slica  (marker)"  spactrum-slica-f lavor 
( spact ral-slica-attributa-window)  nil 
: spact rum-nama  "Wida-Band  Spactrum" 

:cursor-name  :narker-time) 

(daf ina-attributa  "Wida-Band  Spactral  Slice"  f f t-spactral-slica-f lavor 
( spact ral-slica-attributa-window)  nil 
-.wavaf orm-attributa-nama  "Original  Wavaform" 

:cursor-name  -.cursor-time 
:filter-type  :hamming 
:filter-spec  ( : bandwidth  300.0) 

:number-of-points  256.) 

(dafina-attributa  "Wida-Band  Spactral  Slica  (marker)"  f f t-spactral-slica-f lavor 
(spact ral-slica-attributa-window)  nil 
:waveform-attribute-name  "Original  Wavaform" 

:cursor-name  :marker-time 
:filtar-typa  thamming 
: filter-spec  ( :bandwidth  300.0) 

:number-of-points  256.) 

(daf ina-attributa  "Narrow-Band  Spectrogram"  stretched-fft-spectrogram-flavor 


genvmnnn* 


( 


i 


nr  vrrvrt  irww 


( apact rograa-att ri but*- window)  nil 
:wavafora-attributa-naaa  "Original  Wavafora" 

: spactrograa-aita  320. 

:analyaia-rata  383. 

:filtar-typa  :haausing 

••  f  iltar-apac  (-.bandwidth  76.0) 

:nuabar-of-points  256. 

:whita-valua  -96.  ;  —  3 6 

:black-valua  -80.)  ,--20 

( daf ina-attributa  "Wida-Band  Spactrograa"  stratchad-f ft-spactrograa-f lavor 
( a pact rograa-att ributa-window )  nil 
: wavaf ora-att r ibuta-naaa  "Original  Wavafora" 

: apact rograa-ai la  320. 

:analyaia-rata  383. 

:filtar-typa  :haaaing 
:filtar-apac  ( :bandwidth  300.0) 

:nuabar-of-points  256. 

:vhita-valua  -96.  ;-36 

:black-valua  -80.)  ;-20 


( daf ina-att r lbuta  "Varaatac  Spactrograa”  stratchad-f f t-apact rogram-f lavor 
(apact rograa-att ributa-window)  nil 
twavaf ora-attributa-naaa  "Original  Wavafora" 

:spactrograa-si*a  840. 

-.analysis-rata  1000. 

:filtar-typa  :haauaing 
:filtar-spac  ( rbandwidth  400.0) 

:nuiabar-of-points  128. 

:whita-valua  -100.  ;-36 

:black-valua  -75.)  ;-20 


(daf ina-attributa  "Naw  Narrow-Band  Spactrograa” 

( apact rograa-att  ributa-window ) 
:wavafora-attributa-naaa  "Original  Wavafora" 

: spactrograa-sixa  840. 

:analysis-rata  1000. 

:filtar-typa  :haaaing 
:filtar-apac  (:bandvidth  76.0) 
inuabar-of-points  128. 

:whita-valua  -100. 

:black-valua  -75 .  ) 


stretchad-fft-spactrograa-f lavor 
nil 

.-try  8  40 
;try  1000 

: try  400 

;  —  3 6  try  -102 


(daf ina-attributa  "Phonatic  Transcription"  hand-transcription-flavor 
( t ran script ion-att ributa-window) 

(("Naw  Phonatic  Transcription" 

(svaluas)  tokan-attributa-window  :x-scala  383.0 
:string-font  fonts : ipal2 ) ) 

:untranscribad-string  "<  'ntr?nskrYbd> " 

:string-font  fonts:ipal2) 

(daf ina-attributa  "Orthographic  Transcription"  hand-transcription-flavor 
( transcript ion-att ributa-window ) 

(("Naw  Orthographic  Transcription" 

(:valuas)  tokan-attributa-window  :x-scala  383.0)) 

: tokan-saparator  #\spaca 
:untranscribad-at ring  "<untranscribad>" 

:atring-font  fonts:hll2b) 


3 

/. 

V. 


( daf ina-attributa  "rirat  Formant"  foraant-f lavor 
( saaplad-attributa-window )  nil 


:indax  1 

: indaxad-attributa-naaa 


"Foraants "  ) 


(daf ina-attributa  "Sacond  Foraant"  foraant-f lavor 
(saaplad-attributa-window)  nil 

:indax  2 

:  indaxad-attributa-naaa  "Foraants "  ) 


(daf ina-attributa 
:indax  3 


"Third  Foraant"  foraant-f lavo r 
( saaplad-attributa-window)  nil 


A -5 


Sl 


r-N 

v, 

> 

rj 

wN 

**  id 

■V 

A1 


iMuq 


**JT  n.  ir^jr  r\jr 


'.TJV.'ir  -irvwva-u  wv  r-viru  v-n 


:  indaxad-attributa-naaa  "Foraants " ) 

(daf ina-attributa  "Fourth  Foraant"  foraant-f lavor 
( saaplad-attributa-window )  nil 

:indax  4 

: indaxad-at t r ibuta-naaa  "Foraanta” ) 

*1 

( daf ina-attributa  "Fundaaantal  Fraquancy"  pitch-flavor 
nil  ( saaplad-attributa-window ) 

: voicinq-att r ibuta-naaa  “Voicing" ) 

(daf ina-attributa  "Voicing"  voicing-f lavor 

nil  (saaplad-attributa-window) 
:analysis-rata  100. 

:wavafora-attributa-naaa  "Original  Wavaform") 

I* 


A-6 


-*■  in  a". 


-  v  -  •"* 

La  am  *- 


Mod*:  LISP;  Bas*:  10;  Syntax:  Zatalisp  -*- 


Thia  fil*  containa  th*  nacaaaary  function  to  conput*  tha  dynamic 

tima  warp  array,  givan  faatura  arrays  from  th*  taaplat*  and  th*  uttaranc*. 


; ; ;  "TIMEWARP" 

;;;  This  function  racaivas  a  pair  of  arrays,  datarminas  thara  dimansionality 
;;;  and  calls  TIMEWARP- ID  or  TIMEWARP- 2d  accordingly 

(dafun  timawarp  (arrayM  arrayN) 

(cond  (<«  1  (array-d-dims  arrayM))  (ti**warp-ld  arrayM  arrayN)) 

(  («  2  ( array-#-dims  arrayM))  (tin*warp-2d  arrayM  arrayN)))) 


; ; ;  "TIMEWARP-2D" 

;;;  This  function  will  computa  th*  Dynamic  Tim*  Warp  array  givan  th*  arrays 
;;;  arrayM  and  arrayN.  ArrayM  ie  a  x-by-M  array,  and  arrayN  is  a  x-by-N  array. 

;  x  must  b*  th*  sam*  for  both  arrayM  and  arrayN.  This  function  is  maant  for 

thos*  spir*  att's  that  raturn  2-D  arrays  such  as  th*  "Wida-Band  Spactrum"  and 
;  and  "Formants".  Tha  distanc*  maaaura  uaad  is  Minkowski  1  or  2: 

;;;  distanc*  «  KaO-bO]-2  +  (al-bl)'2  +  ...  ( *M-bN )  *  2  )  *  ( 1/2  )  or 

;;;  distanc*  «  abs(aO-bO)  +  abs(al-bl)  ... 

;;;  Input:  arrayM,  arrayN 

;;;  Output:  A  M-by-N  DTW  array 

(dafun  tim*warp-2d  (arrayM  arrayN) 

(lat*  ((M  (-  ( array-dimans ion-n  2  arrayM)  6)) 

(N  (-  ( a r ray-dimans ion-n  2  arrayN)  6)) 

(langth  (cond((=  ( array-dimansion-n  1  arrayM)  5) 

21 

( (■  ( array-dimansion-n  1  arrayM)  16) 

16  ) 

(<■  (array-dimansion-n  1  arrayM)  19) 

19) 

(t 

(princ  "Timawarp  ERROR.  Hit  Control-Abort") 

(do  ( (x  0  )  ) 

( (-  x  1 )  I  )  )  I  ) 

(start  (cond((>  (array-dimansion-n  1  arrayM)  5) 

1  ) 

( ( «  (array-dimansion-n  1  arrayM)  16) 

0) 

((«  (array-dimanaion-n  1  arrayM)  19) 

0)  )  I 

(distanc*  0) 

( raault-array  (maka-array  (list  M  Nl))) 

(loop  for  n-ind*x  from  0  balow  N  do 
(loop  for  m-indax  from  0  balow  M  do 
(satq  Distanc*  0.0) 

(loop  for  v-ind*x  from  start  balow  (  +  start  langth)  do 

(satq  distanc*  (  +  distanc*  (abs  (-  (araf  arrayM  v-ind*x  m-ind*x) 

(araf  arrayN  v-ind*x  n-indax ) ) ) ) ) ) 
(aaat  distanc*  raault-array  m-indax  n-ind*x))) 
raault-array ) ) 


TIMEWARP-1D 


This  function  coaputas  a  Dynaaic  Tina  Warp  array  givan  vactorM  and  vaetorN.  In 
thia  caas  tha  distanca  ■  aba(a-b)  for  aach  a  and  b  in  vactorM  and  vaetorN. 


;;  Input:  vactorM,  vaetorN 
Output:  a  M-by-N  DTW  array 

(dafun  tiaawarp-ld  (vactorM  vaetorN) 

(lat*  ( (M  (-  (array-diaansion-n  1  vactorM)  5)1 
(N  (-  (array-diaansion-n  1  vaetorN)  5)) 

(  raturn-array  (aaka-array  (list  M  N ) ) ) ) 

(do*  ((n-indax  0  (1*-  n-indax  )  )  ) 

(  («  n-indax  N)  ) 

(do*  ((n-indax  0  (1+  n-indax))) 

( (■  a-indax  Ml ) 

(asat  (abs  (-  (araf  vactorM  a-indax)  (araf  vaetorN  n-indax))) 
raturn-array  a-indax  n-indax))) 
raturn-array ) ) 


;  ;  "PRINT-DTW" 

; ;  This  function  will  show  tha  Dynaaic  Tima  Warp  array.  This  function  is 
;;  raally  intandad  for  tasting/dabugging  purposas. 

;;  This  function  will  print  a  saction  of  a  2-D  array  baginning  at  (a,b). 

(dafun  print-dtw  (array  a  b) 

( claarscraan ) 

(do  ( (i  (+  a  40)  (1-  i) ) ) 

( <■  i  (1-  a) ) ) 

(do  ( (  j  b  (1+  j) I ) 

( (-  j  <+  b  14) ) ) 

(princ  (format  nil  ""2,1,8,'  $"  (araf  array  i  j)))) 

( tarpri  ) )  ) 


"DRAWBORDER" 

This  function  will  draw  a  bordar  on  tha  salactad  window 


(dafun  drawbordar  (xl  yl  x2  y2) 

(sand  tv : salactad-window  :draw-lina  xl  yl  x2  yl ) 

(sand  tv : salactad-window  :draw-lina  xl  yl  xl  y2) 

(sand  tv : salactad-window  :draw-lina  xl  y2  x2  y2) 

(sand  tv : salactad-window  :draw-lina  x2  y2  x2  yl ) ) 


; ; ;  "PLOT-COMPOSITE-DTW" 

(dafun  plot-conposita-dtw  (array-list  thrashold  tOPTIONAL  saarch-list  tanpath  uttpath  titla) 
( lat*  ( ( radius  0  ) 

(total-M  (apply  '+  (aapear  ’array-diaansion-n  (circular-list  II  array-list))) 
(total-N  (array-diaansion-n  2  (car  array-list))) 

(xl  400) 

(yl  45) 

(y2  (♦  yl  (ain  595  total-M))) 

(yranga  (-  y2  yl )  ) 

( x 2  (fix  (+  xl  (*  2  (•  total-N  (//  (float  yranga)  total-M ))) I ) ) 

(xranga  (-  x2  xl ) ) ) 

( claarscraan ) 

(drawbordar  xl  yl  x2  y 2 ) 

(sand  tv : salactad-window  :draw-string  titla  x2  (-  yl  4)  0  (-  yl  4)  nil  fonts:trl2b) 

(do*  ((a-list  array-list  (edr  a-list)) 

(k  0  (la  k) ) 

(array  (car  a-list)  (car  a-list)) 

(bottoa  y2  (-  bottom  currant-yranga ! ) 

(v-word  (car  ‘vocabulary*)  (cond  ((not  (null  a-list)) 

(nth  k  ‘vocabulary*)) 

(t  nil  )  )  ) 

(currant-M  (array-diaansion-n  1  array)  (cond  ((not  (null  a-list  I  ) 

(array-diaansion-n  1  array)) 


(t  on  i 

(currant-N  ( ar ray-diaansion-n  2  array)  (cond  ((not  (null  a-list)) 

(array-diaansion-n  2  array)) 

(t  0)  )  ) 

( currant-y ranga  (*  yranga  (//  (float  currant-M)  total-M) ) 

(*  yranga  (//  (float  currant-M)  total-M)))) 

( (null  a-list )  ) 

(sand  tv:s#lactad-vindov  :draw-lina  (-  xl  70)  (round  bottoa)  (+  10  x2)  (round  bottoa)  ) 
( display-wavaf ora-rot  (-  xl  50)  (round  (-  bottoa  currant-yranga )  ) 

(1-  xl)  (round  bottoa) 

( string-appand  taapath  v-word  ".utt")) 

(sand  tv : salactad-window  :draw-string 
(nth  k  *vo-list*) 

(-  xl  60) 

(+  (round  (-  bottoa  (//  cur rant-y ranga  2)))  6) 

0 

(+  (round  (-  bottoa  (//  currant-yranga  2)))  6) 
nil 

fonts :bigfnt ) 

(do  ( (a-indax  0  (la  a-indax))) 

((«  a-indax  currant-M)) 

(do  ( (n-indax  0  (1+  n-indax))) 

((«  n-indax  currant-N)) 

(satq  radius  (cond  ((<  (araf  array  a-indax  n-indax)  thrashold)  0) 

(t  -1 1  )  ) 

(cond  ( («  radius  -1)  nil) 

(t  (sand  tv : salactad-window  :draw-point 

(round  (+  xl  (*  n-indax  (//  (float  xranga)  currant-N)))) 

(round  (-  bottoa  (*  a-indax  (//  currant-yranga  currant-M) )  )  ) 

)))))) 

(do  ((n-indax  0  (  +  n-indax  10!)) 

((>  n-indax  total-N)) 

(sand  tv : salactad-window  :draw-lina 

(round  (♦  xl  (*  n-indax  (//  (float  xranga)  total-N)))) 

(-  y2  4) 

(round  (+  xl  (*  n-indax  (//  (float  xranga)  total-N)))) 

(+  5  y 2  )  )  ) 

(do  ((a-indax  o  (♦  a-indax  10))) 

((>  a-indax  total-M)) 

(sand  tv : salactad-window  :draw-lina 
(-  xl  5) 

(round  (-  y2  (*  a-indax  (//  (float  yranga)  total-M)))) 

(+  xl  5) 

(round  (-  y2  (•  a-indax  (//  (float  yranga)  total-M)))))) 

(cond  ((not  saarch-list )  ) 

(t  (loop  for  word  in  saarch-list  do 

(sand  tv : salactad-window  :draw-lina 

(round  (+  xl  (*  (nth  1  word)  (//  (float  xranga)  total-N)))) 

yi 

(round  (+  xl  (*  (nth  1  word)  (//  (float  xranga)  total-N)))) 

(  +  y2  701) 

(sand  tv : salactad-window  :draw-string 
(nth  (car  word)  *vo-list*) 

(-  (round  (+  xl 

(*  (//  (+  (nth  1  word)  (nth  2  word))  2) 

(//  (float  xranga)  total-N))))  10) 

(+  70  y 2  ) 

(-  (round  (+  xl 

(*  (//  (  +  (nth  1  word)  (nth  2  word))  2) 

(//  (float  xranga)  total-N)  )))  10) 

(+  70  y 2  ) 
nil 

fonts :bigfnt ) ) 

(sand  tv : salactad-window  :draw-lina 
x2  y2  x2  ( +  y2  70 ) ) ) ) 

(cond  ((not  uttpath)) 

(t  ( di aplay-wavaf ora  xl  (1+  y 2 )  x2  (+  y2  50)  uttpath))))) 


’PLOT-DTW" 


(dafun  plot-dtw  (array  pathl  path2  thrashold  ) 


( 1st*  ( ( radius  0  ) 

(M  (array-diaansion-n  1  array)) 

(N  (array-diaansion-n  2  array)) 

(xl  200) 

<x2  900) 

(squish-f actor  (//  (-  x2  xl)  (float  N )  )  ) 

(yl  150) 

( y 2  (fix  (+  (*  M  squish-f actor )  yl ) ) ) 

(nota  (proapt-and-raad  rstring  "Distanca  Array  Nana?  "))) 

(claarscraan) 

(display-wavafora  xl  (1+  y2 )  x2  (+  y2  100)  path?) 

(display-wavafora-rot  (-  xl  100)  yl  (1-  xl)  y2  pathl) 

(distribution  xl  (♦  y2  120)  x2  (+  y?  320)  (list  array)  50) 

(drawbordar  xl  yl  x2  y 2 ) 

(do  ( (a-indax  0  (1+  a-indax ) I ) 

( («  a-indax  M) ) 

(do  ( (n-indax  0  (1+  n-indax))l 
( ( »  n-indax  N ) ) 

(satq  radius  (cond  ((<  (araf  array  a-indax  n-indax)  thrashold)  2) 

((<  (araf  array  n-indax  n-indax)  (*  1.5  thrashold))  1) 

((<  (araf  array  n-indax  n-indax)  (*  1.75  thrashold))  0) 

(t  -1)1) 

(cond  ((«  radius  -1)  nil) 

(t  (sand  tv : salactad-vindow  : draw-f i 1 1 ad-in-ci rcla 

(fix  (+  xl  (*  n-indax  (//  (-  x2  xl)  (float  N))))) 

(fix  (-  y2  (*  a-indax  (//  (-  y2  yl  )  (float  M)  )  )  )  I 
radius ) ) ) ) ) 

(sand  tv : salactad-window  :draw-string  nota  x2  (-  yl  5)  xl  (-  yl  5)  nil  fonts:trl2b) 
(sand  tv : salactad-window  :draw-string 
"Tast  Pattarn" 

(-  x2  10)  (+  y2  25)  0  (+  y2  25)  nil  fonts:trl2b) 

(sand  tv : salactad-window  :draw-string 
"Tamplata" 

(-  xl  15)  (-  yl  5)  0  (-  yl  5)  nil  fonts:trl2b) 

(do  ((n-indax  0  (  +  n-indax  10))) 

( ( >  n-indax  N ) ) 

(sand  tv : salactad-window  :draw-lina 

(fix  (+  xl  (*  n-indax  (//  (-  x2  xl )  (float  N ) ) ) ) ) 

(-  y2  5) 

(fix  <♦  xl  (•  n-indax  (//  (-  x2  xl)  (float  N ) ) > ) ) 

<♦  y2  6)  )  ) 

(do  ((a-indax  0  <♦  n-indax  10))) 

( ( >  n-indax  M ) ) 

(sand  tv: salactad-window  :draw-lina 
(-  xl  6) 

(fix  (-  y2  (*  m-indax  (//  (-  y2  yl )  (float  M)))>) 

(♦  xl  5) 

(fix  (-  y 2  (*  n-indax  (//  (-  y2  yl )  (float  M )))>))>) ) 


"COMBINE-DTW” 

This  function  will  waight  and  combina  two  or  nora  dtw-airays. 

Input:  dtwlist  «>  a  list  of  dtw's  to  combina 

waightlist  »>  list  of  waight  factors  to  apply  to  dtwlist 

Output:  naw  dtw-array 


(dafun  conbina-dtw  (dtwlist  waightlist) 

(1st*  ( (n-dinansion  { a r ray-dimans ion-n  1  (car  dtwlist))) 
(n-dinansion  ( a r ray-di nans ion-n  2  (car  dtwlist))) 
(raturn-dtw  (naka-array  (list  n-dimsnsion  n-dimansion I ) ) 
(sun  0)) 

(do  ( (n  0  ( 1+  n) ) ) 

((»  n  n-dinansion)) 

(do  ( (n  0  (1+  n  )  )  ) 

<(•  n  n-dinansion)) 

(satq  sun  0.0) 

(do*  ((dtw-indsx  dtwlist  (cdr  dtw-indax)) 

(dtw-array  (car  dtw-indax)  (car  dtw-indax)) 


(weight-indax  weightlist  (cdr  waight-indax ) ) 
(waight-valua  (car  waight-indax)  (car  waight-indax))) 
((null  dtw-indax)) 

(satq  sua  (+  sun  (*  waight-indax  (aref  dtw-array  n  n ) )  ) ) ) 
(asat  sun  raturn-dtw  n  n))) 
raturn-dtw) ) 


;  "HAKE-DTW" 

;;;  This  routine  conputas  a  Dynamic  Tima  Warp  Array  giva  tha  pathnamas  of 
two  uttarancas  and  a  Spira  attributa  nama  (ax.  "Formants"). 

Tha  ordar  in  which  tha  pathnamas  ara  passad  is  significant,  ia., 

;;;  whan  plottad  tha  first  pathnama  will  run  along  tha  vartical  axis,  and 
;  ;  ;  tha  sacond  pathnama  will  run  across  tha  horizontal  axis.  Whan  matching 

individual  word  uttarancas  against  continous  spaach  uttarancas,  it  is  bast 
;  to  pass  tha  individual  word  pathnama  first. 

Input:  pathnanal ,  pathnama2,  spira  attributa 

; ;  ;  Exampla  Call:  (maka-dtw  " > dawson > th r aa > "  ">dawson>phona-no”  "Wida-Band  Spactrum") 
;;;  Raturns:  a  two  dimansional  array.  Tha  numbar  of  columns  (width)  is 
;  proportional  to  tha  langth  of  pathnama2.  Tha  numbar  of  rows  (haight)  is 
;;;  proprtional  to  tha  langth  of  pathnanal. 

(dafun  maka-dtw  (pathl  path2  att) 

(lat*  ((a  (cond  ((aqual  att  "Wida-Band  Spactrum")  ( column-normaliza-array 

( f raquancy-comprass-lfa 

(computa-att  pathl  att)))) 
((aqual  att  "LPC  Spactrum")  (column-normaliza-array 

( f raquancy-comprass-lf e 

(computa-att  pathl  att)))) 

((aqual  att  "Narrow-Band  Spectrum")  (column-normaliza-array 

( f  raquancy-comprass-lf a 

(computa-att  pathl  att)))) 

((equal  att  "Formants")  (regionize 

( median-filter 

(computa-att  pathl  att)))) 

((equal  att  "taro  crossing  rata”)  ( vector-energy-normalize 

(computa-att  pathl  att))) 

(t 

(computa-att  pathl  att)))) 

(b  (cond  ((equal  att  "Wida-Band  Spactrum")  (column-normaliza-array 

( f raquancy-comprass-lfa 

(computa-att  path2  att)))) 
((equal  att  "LPC  Spactrum")  (column-normaliza-array 

( f  raquancy-comprass-lf a 

(computa-att  path2  att)))) 

((equal  att  "Narrow-Band  Spactrum")  (column-normaliza-array 

( f raquancy-comprass-lfa 

(computa-att  path2  att)))) 

((equal  att  "Formants")  (ragioniza 

(median-filter 

(computa-att  path2  att)))) 

((equal  att  "zero  crossing  rata")  ( vact or-ana r gy-no rmal i za 

(computa-att  path2  att  )  )  I 
(t 

(computa-att  path2  att)))) 

(return-array  (timawarp  a  b))) 
return-array )  ) 


;  ;  ’NEW-READY-DTW-LPC-FORKANTS" 

(dafun  naw-raady-dtw-lpc-f ormants  (template  utterance) 

(lat*  ((dtw-list  (list  (timawarp  (car  template!  (car  utterance)) 

(timawarp  (cadr  template)  (cadr  utterance)))) 

(m-dimension  (array-dimansion-n  1  (car  dtw-list))) 

(n-dimension  (array-dimansion-n  2  (car  dtw-list))) 

(raturn-dtw  (maka-array  (array-dimansions  (car  dtw-list))  :type  ’art-16b)l) 
(loop  for  m  from  0  below  m-dimension  do 
(loop  for  n  from  0  below  n-dimension  do 


(let  ((frfrt  (aref  (caddr  template)  m ) ) 

(frfru  (aref  (caddr  utterance)  n )  ) 

(t-region  (aref  (cadr  template)  m)  ) 

(u-region  (aref  (cadr  utterance)  n)) 

(distance  (*  1000  (car  'weight-list* )  (aref  (car  dtw-list)  m  n ) ) ) ) 
(cond  ((or  (>  frfrt  1500) 

(>  frfru  1500) 

(■  t-region  0) 

(not  (»  t-region  u-region))) 

(aset  (fix  distance)  return-dtw  m  n ) ) 

(t  (aset  (fix  (*  0.4  distance))  return-dtw  m  n)))))) 

return-dtw ) ) 


"READY-DTW-LPC-FORMANTS-FF" 

(defun  ready-dtw-lpc-f ormants-f f  (template  utterance) 

(let*  ((dtw-list  (list  (timewarp  (car  template)  (car  utterance)) 

(timewarp  (cadr  template)  (cadr  utterance)))) 

(m-dimension  ( array-dimension-n  1  (car  dtw-list))) 

(n-dimension  (array-dimension-n  2  (car  dtw-list))) 

(return-dtw  (make-array  (array-dimensions  (car  dtw-list))  :type  'art-16b)l 
( sum  0.0)) 

(do  ( ( m  0  (1+  m ) ) ) 

((•  m  m-dimension)) 

(do  ( ( n  0  (1+  n )  )  ) 

((>  n  n-dimension)) 

(setq  sum  0.0) 

( cond ( ( and 

(<  (aref  (caddr  template)  mi  1700) 

(<  (aref  (caddr  utterance)  n)  1700) 

(<  (aref  (cadr  template)  1  m)  750) 

(<  (aref  (cadr  utterance)  1  n)  750) 

(<  (aref  (cadr  template)  2  m)  2200) 

(<  (aref  (cadr  utterance)  2  n)  2200)) 

(setq  sum  (*  1000  (cadr  ‘weight-list* )  (aref  (cadr  dtw-list)  m  n)))) 
(t 

(setq  sum  (*  1000  (car  ‘weight-list  * t  (aref  (car  dtw-list)  m  n))!)) 
(cond  ((<  (fix  sum)  65535) 

(aset  (fix  sum)  return-dtw  m  n)) 

(t  (princ  "Overflow"))))) 
return-dtw )  ) 


;  ;  "READY-DTW" 

This  function  computes  a  combined  dtw  from  a  couple  lists  of  feature  arrays 
; ;  and  return  that  combined  dtw  array.  It  receives  as  input  two  lists  of  feature 
arrays.  It  then  calls  TIMEWARP  to  do  the  Dynamic  Time  Warps  and  then  calls 
COMBINE-DTW  to  average  together  the  individual  dtw's  into  one  dtw.  Remember 
;;;  the  feature  arrays  have  already  been  computed  by  PROCESS-UTTERANCE. 

(defun  ready-dtw  (template  utterance) 

(let*  ((dtw-list  (mapear  'timewarp  template  utterance)) 

(m-dimension  (array-dimension-n  1  (car  dtw-list))) 

(n-dimension  (array-dimension-n  2  (car  dtw-list))) 

(return-dtw  (make-array  (array-dimensions  (car  dtw-list))  :type  ’art-16b)) 
( sum  0.0)) 

(do  ((m0  ( 1+  m )  )  ) 

((*  m  m-dimension)) 

(do  ( (n  0  (1+  n)  )  ) 

((  =  n  n-dimension)) 

(setq  sum  0.0) 

(loop  for  dtw  in  dtw-list 

for  weight  in  *weight-list*  do 
(setq  sum  (+  sum  (*  1000  weight  (aref  dtw  m  n )  )  )  )  ) 

(cond  ((<  (fix  sum)  65535) 

(aset  (fix  sum)  return-dtw  m  nil 
(t  (princ  "Overflow"))))) 
return-dtw) ) 


( 


p 


"  COMPUTE-COMPOS  ITE-DTW” 

This  function  computes  s  composite  dtw  array  batwsan  a  Ready-Template 
and  a  Ready-Utterance .  In  other  words  dtw's  (Dynamic  Time  Warps)  are  performed 
between  the  utterance  and  each  word  of  the  vocabulary.  The  separate  dtw  arrays 
put  in  a  list  to  form  one  composite  array. 

Input:  None.  *t-set*  and  *ready-utterance*  are  used. 

Output:  composite  dtw 


(defun  compute-composite-dtw  () 

(let  ((result-list  nil)) 

(princ  "Count-Down:  ") 

(loop  for  template  in  *t-set* 

for  count  from  (length  *t-set*l  downto  0  do 
(princ  (format  nil  ""D-"  count)) 

(setq  result-list  (append  result-list 

(list  (new-ready-dtw-lpc-f ormants 

template  *  ready-utterance *)))) ) 

(terpri ) 
result-list )  ) 

;;;  "old"  »«>  (mapear  'ready-dtw  *t-set*  (circular-list  *ready-utterance* )  )  ) 


; ; ;  "DISTRIBUTION" 

;;;  This  function  take  a  composite  dtw  array  and  computes  the  distribution 
;;;  of  its  values.  The  second  argument  specifies  the  number  of  bars  to 
be  drawn. 

(defun  distribution  (xl  yl  x2  y2  cdtw  res) 

(let*  ( (mean  0.0) 

(min  tie) 

(max  -le) 

(sum  0.0) 

(sum-sq  0.0) 

(vari  0.0) 

( nun  0 ) 

(pdf  (make-array  res  ':type  art-16b  ': initial-value  0)) 

(width  (fix  (//  (-  x2  xl)  res))' 

(space  (fix  (//  width  3))) 

(bar  (-  width  space)) 

(pdf-max  -le ) 

(title  ( prompt-and-read  :string  "Title?  "))) 

(drawborder  xl  yl  x2  y 2 ) 

(send  tv : selected-window  :draw-string  title  x2  (-  yl  5)  0  (-  yl  5)  nil  fonts:trl2b) 
(loop  for  dtw  in  cdtw  do 

(loop  for  i  (fixnum)  from  0  below  (array-dimension-n  1  dtw)  do 
(loop  for  j  (fixnum)  from  0  below  (array-dimension-n  2  dtw)  do 
(cond  ((<  (aref  dtw  i  j)  min) 

(setq  min  (aref  dtw  i  j))) 

( ( >  (aref  dtw  i  j )  max ) 

(setq  max  (aref  dtw  i  j)))) 

(setq  sum  (  +  sum  (aref  dtw  i  j))) 

(setq  sum-sq  (+  sum-sq  (sqr  (aref  dtw  i  j)))) 

(setq  nua  (1+  nun) ) ) ) ) 

(setq  mean  (//  sum  nun) ) 

(setq  vari  (//  (-  (*  num  sum-sq)  (sqr  sum))  (*  nun  (1-  num) ) ) ) 

(loop  for  dtw  in  cdtw  do 

(loop  for  i  (f.xnun)  from  0  below  (array-dimension-n  1  dtw)  do 
(loop  for  j  (fixnum)  from  0  below  (array-dimension-n  2  dtw)  do 
(setq  num  (fix  (*  (-  (aref  dtw  i  j)  min) 

(//  (1-  (array-dimension-n  1  pdf))  (float  (-  max  min)))))) 
(aset  (1+  (aref  pdf  num))  pdf  num)))) 

(loop  for  i  from  0  below  (array-dimension-n  1  pdf)  do 
(cond  ((>  (aref  pdf  i)  pdf-max) 

(setq  pdf-max  (aref  pdf  i  )  )  )  )  ) 

(loop  for  i  from  0  below  (array-dimension-n  1  pdf)  do 


B-8 


(sand  tv : salactad-window  : draw-rsctangla 
bar 

(fix  (*  (araf  pdf  i)  (//  (float  (-  y2  yl  20))  pdf-nax ) ) ) 

(+  1  xl  spaca  (*  i  width)) 

(fix  (-  y2  (*  (araf  pdf  i)  (//  (float  (-  y2  yl  20))  pdf-nax)))))) 

(sand  t v ; salactad-window  idraw-string 

(fornat  nil  "Maan  ■  ~D"  naan)  xl  (+  yl  15)  x2  (+  yl  15)  nil  fonts:trl2b) 
(sand  tv:salactad-window  :draw-string 

(fornat  nil  "Min  ■  "D"  nin)  xl  (+  yl  30)  x2  (+  yl  30)  nil  fonts:trl2b) 
(sand  tv:salactad-window  :draw-string 

(fornat  nil  "Max  «  *D"  nax )  xl  (+  yl  45)  x2  (♦  yl  45)  nil  fonts:trl2b) 
(sand  tv ; salactad-window  :draw-string 

(fornat  nil  "Var  «  "D"  vari)  xl  (+  yl  60)  x2  (+  yl  60)  nil  fonts:tr!2b) 


"MAKE— DTW— LIST" 


This  function  nakas  rapaatad  calls  to  "MAKE-DTW”  and  satq's  aach 
variabla-list  to  tha  corrasponding  itam  in  attributa-list . 


(dafun  naka-dtw-list  (pathnanal  pathnana2  variabla-list  attributa-list) 
(do*  ( (dtw-list  variabla-list  (cdr  dtw-list)) 

(dtw-nana  (car  dtw-list)  (car  dtw-list)) 

(att-list  attributa-list  (cdr  att-list)) 

(att-nana  (car  att-list)  (car  att-list))) 

( (null  dtw-list ) ) 

(sat  dtw-nana  (maka-dtw  pathnanal  pathnama2  att-nana)))) 


"SCAN— DTW" 


This  function  scans  tha  conposita  Dynanic  Tina  Warp  Array  and 
datarninas  what  words  ara  containad  in  tha  tast  uttaranca.  Tha  algorythm 
usad  is  tha  "Ona-Staga  Dynanic  Frogranning  Algorythn  for  Connactad  Word 
Racognition"  by  Harnann  Nay.  Saa  IEEE  Transactions  ASSP-32  No.  2  April  1984. 


(dafun  scan-dtw  ( conposita-dtw ) 

(lat*  (Ititla  (pronpt-and-raad  rstring  "Titla?  ")) 

(N  (array-dinansion-n  2  (car  conposita-dtw))) 

(D-list  (napcar  'naks-array 

(napcar  'array-dinansion-n  (circular-list  1)  conposita-dtw))) 
(B-list  (napcar  'naka-array 

(napcar  'array-dinansion-n  (circular-list  1)  conposita-dtw))) 
( f ron-tanplata  (naka-array  N  :typa  'art-lb)) 

(fron-frana  (naka-array  N  :typa  'art-16b)> 

( d-nin ) 

( sava-b ) 

( sava-d ) 

( sava-tanp) 

(a  1.0) 

(b  0.5) 

( raturn-list ) 

( dunny  + 1 a ) ) 


(tarpri)  (print  "Conputing  Accunulatad  Distanca  Array") 
(tarpri)  (prrnc  "Bagin  Stap  1  ...  ") 

(loop  for  currant-dtw  in  conposita-dtw 
for  currant-ada  in  D-list 
for  currant-B  in  B-list  do 

(loop  for  n  fron  0  balow  (array-dinansion-n  1  currant-dtw) 
sun  (araf  currant-dtw  n  0)  into  local-sun 
do  (aaat  local-sun  currant-ada  n) 

(asat  0  currant-B  n))l 
(princ  "Dona.") 


;  aach  k 
;  n  :  *  0  .  .  -j  - 1 
Sun  for  i*0 


asst  initial  valuas 


; ; STEP  2 


(tarpri)  (princ  "Bagin  Stap  2  ...  ") 


(loop  for  i  fixnua  f ro*  1  balow  N  do 
(satq  duaay  +  la) 

(loop  for  currant-dtw  in  coaposita-dtw 
for  currant-ada  in  D-list 
for  currant-B  in  8-li*t 

for  k  from  0  to  (langth  coaposita-dtw)  do 
(satq  d-ain  (ain  (araf  currant-ada  0) 

(apply  'ain  (aapcar  'araf  D-list 

(aapcar  '1-  (aapcar  'array-diaansion-n 
(circular-list  1) 

D-list) ) ) ) ) ) 

(cond  ((not  («  d-ain  (araf  currant-ada  0))) 

(asat  (+  i  1)  currant-B  0))) 

(satq  sava-d  (araf  currant-ada  0)) 

(satq  sava-b  (araf  currant-B  0)) 

(asat  (  +  (araf  currant-dtw  0  i)  d-nin)  currant-ada  0) 

(loop  for  j  fixnua  fron  1  balow  (array-diaansion-n  1  currant-ada)  do 
(satq  d-min  (ain  (+  (*  (1+  a)  (araf  currant-dtw  j  i)) 

(araf  currant-ada  j))  ;list  of 

(+  (araf  currant-dtw  j  i)  sava-d)  ;possibla 

(+  (*  b  (araf  currant-dtw  (1-  j)  i ) ) 

(araf  currant-ada  (1-  j)))))  ;pradacassors 

(satq  sava-tamp  (araf  currant-B  j)) 

(cond  ( («  d-ain  (+  (araf  currant-dtw  j  i)  sava-d))  ;Updata 

(asat  sava-b  currant-B  j))  ;Backpointar 

(  («  d-nin  (+  (*  b  (araf  cuyant-dtw  (1-  j)  i )  ) 

(araf  currant-ada  (1-  j))))  .'Array 

(asat  (araf  currant-B  (1-  j))  currant-B  j))) 

(satq  sava-d  (araf  currant-ada  j))  ;sava  diagonal 

(satq  sava-b  sava-tanp)  ;pradacassor  and 

(asat  d-nin  currant-ada  j))  ; and  Backpointar 

;Updata  "Fron  Tanplata” 
.•Array  T I i  ) 

;and  "Fron  Frana” 

; Array  F( i  ] 

(cond  ((<  (araf  currant-ada  (1-  (array-diaansion-n  1  currant-ada)))  dunny) 

(satq  duany  (araf  currant-ada  (1-  (array-dinansion-n  1  currant-ada)))) 
(asat  k  f roa-tanplata  i) 

(asat  (araf  currant-B  (1-  (array-diaansion-n  1  currant-B))) 
f roa-f rana  i  )  )  )  )  I 
ttarpril  (princ  "Dona.") 

; ; ; STEP  3 

(tarpri)  (princ  "Bagin  Stap  3  ...") 

(loop  for  i  froa  (1-  N)  downto  0  do 

(princ  (foraat  nil  ”~D"  (araf  f roa-taaplata  i ) ) > ) 

( tarpri  ) 

(satq  raturn-list 

(do*  <<word-and  <1-  W)  prad) 

(word  (araf  f roa-taaplata  (1-  N ) )  (araf  f roa-taaplata  prad)) 

(prad  (araf  froa-fraaa  (1-  HI)  (araf  froa-fraaa  prad)) 

(answar  (list  word)  (appand  (list  word)  answar)) 

(boundry-list  (list  (list  word  prad  word-and)) 

(appand  (list  (list  word  prad  word-and))  boundry-list))) 
((<■  prad  1)  boundry-list))) 

( plot-coaposi ta-dtw  coaposi ta-dtw 

(*  *thrash*  (langth  *wai ght- 1 i st • )  ) 

raturn-list 

•taapath* 

•uttpath* 
titlal  )  ) 


*C»EATE-COIU>OS  ITE-DTW-F I LE  " 

This  function  craatas  a  Coaposita  DTW  Fila  froa  *t-sat*  and  *uttaranca* 
(dafun  craata-coaposita-dtw-f ila  ( ) 


B-10 


(lat*  ((writa-path  ( string-appand 

"apl : >dawson>thasis>dtw> ” 

(prompt-and-raad  ratring 

"Plaasa  antar  CDTW  naaa  to  craata:  ")))) 
(satq  *cdt%»*  (coaputa-coaposita-dtw) ) 

(duap-to-diak  writa-path  (liat  *cdtw*  *waight-list*  ‘taapath*  *uttpath*)) 

( word-aaarch i ) ) ) 


; ; ;  "LOAD-COHPOSITE-DTW-FILE" 

;;;  This  function  proapts  for  a  cdtw  fila  naaa,  loads  it  and  aatq 'a  it  to  ‘cdtw* 

(dafun  load-coaposita-dtw-f ila  ( ) 

(load  ( string-appand 

"apl : > daws on > thas i a >dtw> " 

(proapt-and-raad  ratring 

"Plaasa  antar  CDTW  nans  to  load:  "))) 

(aatq  *cdtw*  (car  ‘data*)) 

(aatq  *waight-list*  (nth  1  ‘data*)) 

(aatq  ‘taapath*  (nth  2  ‘data*)) 

(aatq  ‘uttpath*  (nth  3  ‘data*)) 

(word-saarch! ) ) 


(dafun  add-tamplata  (taapnaaa2) 

(load  (string-appand  " >dawson> thasia> taaplatas> "  tanpnama2)) 

(satq  *t-sat*  (appand  *t-aat*  (car  ‘data*))) 

(aatq  *tampath2‘  (string-appand  ">dawson> thasis > tamplatas > "  (cadr  ‘data*))) 
(satq  *vo— list*  '("0"  "1"  "2"  "3"  "4"  "5"  "6"  "1”  "8"  "9" 

"10"  "11"  "12"  "13"  "14"  "IS"  "16"  "17"  "18"  "19")) 

(satq  'vocabulary*  (appand  ‘vocabulary*  ‘vocabulary*))) 


B-11 


» 


( 

» 


-*-  Nod*:  LISP;  Ba**:  10;  Syntax:  Zetalisp  -*- 
"UTILITIES" 

This  file  contains  various  utilitias  usad  by  Word-SEARCH! 


"COMPUTE- ATT" 

This  is  a  function  to  gat  th*  att  valuas  for  a 
given  uttaranca  storad  on  disk. 

Calling  Procadura: 

(  conputa-att  utt-nam*  att-nam*  ) 

Exanpl*  Usaga: 

(satq  rasult-array  (compute-att  "spl :  >dawson>alpha  .  utt "  "LPC  Gain  Tarsi")) 
Not*:  rasult-array  now  contains  th*  rasult  of  th*  att  computation. 


(dafun  comput*-att  (pathname  att-nama) 

(lat  ( ( raturn-array ) ) 

(tarpri  ) 

(princ  "Computing  ") 

(princ  att-nama) 

(princ  "...") 

(satq  raturn-array  ( spira : att-val  (sand  ( spi r* :utt*ranc*  pathname)  :find-att  att-nama))) 
(princ  "Don*.") 
raturn-array )  ) 


Not*  :  This  leaves  th*  utterance  described  by  pathname  loaded  until  whenever. 

In  order  to  kill  an  utterance  (unload  is  a  batter  term)  th*  following 
statement  will  do  th*  trick: 

(sand  ( spira :utt*ranc*  pathname)  : k i 1 1 ) 


...  "PPOCESS— UTTERANCE -LPC" 

;;;  Function  to  perform  LPC  computations  on  a  single  utterance. 
; ; ;  This  function  makes  repeated  calls  to  "COMPUTE-ATT". 

;;;  Input  :  Full  pathname  to  utterance 

;;;  Output  :  List  of  arrays  ie.,  computed  features 

(defun  process-utterance-lpc  (pathname) 

(let  ((return-list  (list  (column-normalize-array 

( f requency-compress-1 f * 

( compute-att 
pathname 

"LPC  Spectrum" )))))) 

(setq  "weight-list*  ’(4.511 
return-list )  ) 


;;;  "PROCESS-UTTERANCE-NBS" 

;;;  Function  to  perform  NBS  computations  on  a  single  utterance. 
; ; ;  This  function  makes  repeated  calls  to  "COMPUTE-ATT". 

;;;  Input  :  Full  pathname  to  utterance 

;;;  Output  :  List  of  arrays  i*.,  computed  features 

/  l  < 

(defun  process-uttersnce-nbs  (pathname) 

(let  ((return-list  (list  (column-normalize-array 

( f requency-compress-lfe 
( compute-att 


B -12 


(satq  ‘weight-list*  '(5.0)) 
return-list )  ) 


pathname 

"Narrow-Band  Spactrum" )))))) 


; ;  ;  " PROCESS-UTTERANCE-WBS " 

;;;  Function  to  perform  LPC  coaputations  on  a  single  uttaranca. 
;;;  This  function  aakas  rapaatad  calls  to  "COMPUTE-ATT" . 

;;;  Input  :  Full  pathnaaa  to  uttaranca 

;;;  Output  :  List  of  arrays  ia.,  coaputad  faaturas 

(dafun  procass-uttaranca-wbs  (pathnaaa) 

(lat  ( ( return-list  (list  ( column-noraaliza-array 

( f requency-compress-lfe 
( coaputa-att 
pathnaaa 

"Wida-Band  Spactrum" )))))) 

(satq  *waight-list*  '(4.5)) 
raturn-list )  ) 


;;;  "PROCESS-UTTERANCE-FORMANTS" 

;;;  Function  to  parfora  Foraant  calculations  on  a  singla  uttaranca. 

(dafun  procass-uttaranca-f oraants  (pathnaaa) 

(lat  ((raturn-list  (list  (aadian-f ilter 

( coaputa-att 
pathnaaa 
"Formants" ) ) ) I ) 

(satq  *waight-list‘  ’(.0016)) 
raturn-list ) ) 


;;;  "PROCESS-UTTERANCE-ZCR" 

(dafun  procass-uttaranca-zcr  (pathname) 

(lat  ((raturn-list  nil)) 

(tarpri ) 

(satq  raturn-list  (list  ( vector-mag-norm 

( compute-att 
pathname 

"Zaro  Crossing  Rata")))) 

(satq  ‘weight-list*  '(0.02)) 
raturn-list ) ) 


"PROCESS-UTTERANCE-LPC-FORMANTS” 


;;  Function  to  perform  family  of  computations  on  a  single  uttaranca. 
; ;  This  function  makes  rapaatad  calls  to  "COMPUTE-ATT". 

; ;  Input  :  Full  pathname  to  uttaranca 

;;  Output  :  List  of  arrays  ia.,  computed  faaturas 

(dafun  procaas-uttaranca-lpc-f ormants  (pathname) 

(lat  ( ( raturnad-list  nil)) 

(satq  raturnad-liat  (list  (column-normalize-array 

( f raquancy-compress-lfe 
( computs-att 
pathname 

"LPC  Spactrum"  ) ) )  )  ) 

(satq  raturnad-list  (appand  raturnad-list  (list  ( madian-f i ltar 

( computa-att 

pathname 


(satq  *waight-list*  '(2.44  0.0024)) 
raturnad-list ) ) 


Formants" ) ) ) )  ) 


;  ;  ;  "PROCESS-UTTERANCE-LPC-FORMANTS-FF” 

;;;  Procassas  uttaraneas  for  LPC  Spactrua,  Formants,  and  Frication  Fraquancy. 

(dafun  procass-uttaranca-lpc-f oraants-f f  (pathnana) 

(lat  ( ( raturnad-list  nil)) 

(satq  raturnad-list  (list  (coluan-noraaliza-array 

( f  raquancy-coapr ass-1  fa 
( coaputa-att 
pathnana 

"LPC  Spactrua" ) ) ) ) ) 

(satq  raturnad-list  (appand  raturnad-list  (list  (ragioniza 

(aadian-f iltar 
(coaputa-att 
pathnana 

"Fornants "  )))))) 

(satq  raturnad-list  (appand  raturnad-list  (list  (coaputa-att 

pathnana 

"Frication  Fraquancy")))) 

(satq  *waight-list*  '(4.5  2)) 
raturnad-list ) ) 


; ; ;  "PROCESS-UTTERANCE-WBS-LPC" 

;  ;  Function  to  parfora  family  of  computations  on  a  singla  uttarar.ca. 

; ; ;  This  function  aakas  rapaatad  calls  to  "COMPUTE- ATT" . 

»  a  * 

;;;  Input  :  Full  pathnaaa  to  uttaranca 

;  Output  :  List  of  arrays  ia.,  conputad  faaturas 

(dafun  procass-uttaranca-wbs-lpc  (pathnaaa) 

(lat  ( ( raturnad-list  nil)) 

(satq  raturnad-list  (list  (coluan-noraaliza-array 

( f raquancy-conprass-lfa 
( coaputa-att 
pathnana 

"Wida-Band  Spactrun" ) ) ) ) ) 

(satq  raturnad-list  (appand  raturnad-list  (list  (coluan-noraaliza-array 

( f  raquancy-coaprass-lf a 
( coaputa-att 

pathnaaa 

’LPC  Spactrua" )))))) 

(satq  *waight-list*  '(3.6  5.01) 
raturnad-list I ) 


;;;  "PROCESS-UTTERAUCE-SBS-LPC" 

;;;  Function  to  parfora  faaily  of  computations  on  a  singla  uttaranca. 

;;;  This  function  aakas  rapaatad  calls  to  "COMPUTE-ATT" . 

;;;  Input  :  Full  pathnaaa  to  uttaranca 

:  Output  :  List  of  arrays  ia.,  conputad  faaturas 

(dafun  procass-uttaranca-nbs-lpc  (pathnana) 

(lat  ( ( raturnad-list  nil)) 

(satq  raturnad-list  (list  ( coluan-noraa 1 iza-arrny 

( f raquancy-coapr ass -If a 
( coaputa-att 
pathnana 

"Narrow-Band  Spactrua" III)) 

(satq  raturnad-list  (appand  raturnad- 1 i st  (list  (coluan-noraaliza-array 

( f  raquancy-coapr  as  s- 1 f  a 
( coaputa-att 

pathnaaa 


B- 


(satq  *waight-list*  '(3.6  5.0)) 
raturnad-list ) ) 


LPC  Spactrum" )))))) 


;  " COLUMN-NORMALI ZE-ARRAY " 

(dafun  column-normalica-array  (array) 

(lat*  ((haight  ( array-diaansion-n  1  array)) 

(lanqth  (array-diaansion-n  2  array)) 

( total-anarqy  0) 

( rasult-array  (aaka-array  (list  haight  langth)  ' : initial-valua  0))) 
(do  ((column  0  (1+  column))) 

((«  column  lanqth)) 

( satq  total-anarqy  0) 

(do  ((row  0  (1+  row))) 

( («  row  haiqht ) ) 

(satq  total-anarqy  (+  total-anarqy  (sqr  (araf  array  row  column))))) 
(satq  total-anarqy  (sqrt  total-anarqy ) ) 

(do  ((row  0  (1+  row))) 

( (■  row  haiqht )  ) 

(asat  (//  (araf  array  row  column)  (cond  ( (=  total-anarqy  0)  1) 

( t  total-anargy  )  )  ) 

rasult-array  row  column)) I 
rasult-array ) ) 


"REGIONIZE" 

This  function  takas  as  input  Formants  and  assigns  a  ragion  for 
aach  point  in  tima  according  to  tha  first  and  sacond  formants. 
Each  ragion  raprasants  a  spacifica  vowal  sound. 


(dafun  xor  (alist) 

( lat  ( ( count  0 ) ) 

(loop  for  thing  in  alist  do 
(cond  (thing 

(satq  count  ( 1+  count ) ) ) ) ) 
( oddp  count ) ) ) 


(dafun  intarsact  (saql  sag2) 

(lat*  ( (xll  (nth  0  saql)) 

(yl 1  ( nth  1  sagl ) ) 

( xl2  (nth  2  sagl  )  ) 

(yl 2  ( nth  3  sagl  )  ) 

(x21  (nth  0  sag2  )  ) 

(y21  (nth  1  sag2 )  ) 

(x22  (nth  2  sag2 ) ) 

(y22  (nth  3  sag2 ) ) 

(ml  (//  (float  (-  y 1 2  ylll)  (-  xl2  xll))! 

(m2  (//  (float  (-  y22  y 21 M  (-  x22  x21)!l 

(x  (//  (+  y 2  2  (*  ml  xl2)  (-  0  yl2  (•  m2  x  2  2 )  )  ) 
(tl  (//  (-  x  xll )  (-  xl 2  xll )  I  ) 

( t  2  (//  (-  x  x  2 1 )  (-  x22  x  2 1  )  )  ) 

(rasult  (cond  ((and  ( <■  tl  1.0) 

( >«  tl  0.0) 

( <-  1 2  1.0) 

( >*  1 2  0.0)1 
Tl 

(T  mill)) 


rasult )  I 


(-  ml  m2 ) )  ) 


(dafun  ragionrta  (formants) 
(lat*  ( (fl  0) 

(  f  2  0) 


(result  (make-array  (array-dimension-n  2  formants)  :typa  'art-8b>)) 

(loop  for  time  fixnum  from  0  balow  (array-dimension-n  2  formants)  do 
(setq  fl  (araf  formants  1  time)) 

(setq  f2  (araf  formants  2  time)) 

Mterpri)  (princ  fl)  (princ  ",")  (princ  f2)  (princ  "-") 

(cond  (dor  (list  (intersect  (list  fl  f2  1500  f  2 )  '(0  1750  250  3500)) 

(intersect  (list  fl  f2  1500  f2)  '(250  1750  450  3500)))) 

; (princ  1 ) 

(aset  1  result  time)) 

;(aset  300  formants  1  time  I 
;(aset  2750  formants  2  time)) 

( ( xor  (list  (intersect  (list  fl  f2  1500  f 2 )  '(250  1750  450  3500)) 

(intersect  (list  fl  f2  1500  f2)  '(450  1750  700  3500)))) 

; (princ  2 ) 

(aset  2  result  time)) 

;(aset  420  formants  1  time) 

;(aset  2300  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  f2)  '(450  1750  700  3500)) 

(intersect  (list  fl  f2  1500  f2)  '(900  2500  901  3500)) 

(intersect  (list  fl  f2  1500  f2>  '(600  1750  900  2500)))) 

; (princ  3  ) 

(aset  3  result  time)) 

;(aset  600  formants  1  time) 

Maset  2200  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  f 2 )  '(600  1500  601  1750)) 

(intersect  (list  fl  f2  1500  f 2 )  '(600  1750  900  2500)) 

(intersect  (list  fl  f2  1500  f 2  >  '(750  1500  1200  2500)))) 

; (princ  4 ) 

(aset  4  result  time)) 

Maset  700  formants  1  time) 

Maset  1800  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  f2)  '(750  1500  1200  2500)) 

(intersect  (list  fl  f2  1500  f2)  ’(600  1100  601  1500)) 

(intersect  (list  fl  f2  1500  f2)  '(650  1100  1200  1750)) 

(intersect  (list  fl  f2  1500  f2>  '(1200  1750  1201  2500)))) 

; (princ  5 ) 

(aset  5  result  time)) 

Maset  800  formants  1  time) 

Maset  1500  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  f2)  ’(650  950  651  1100)) 

(intersect  (list  fl  f2  1500  f2l  '(650  1100  1200  1750)) 

(intersect  (list  fl  f2  1500  f 2 )  '(800  950  1200  1100)) 

(intersect  (list  fl  f2  1500  f 2 )  '(1200  1100  1201  1750)))) 

M princ  6  ) 

(aset  6  result  time)) 

Maset  900  formants  1  time) 

Maset  1100  formants  2  time)) 

((xor  (list  (intersect  (list  fl  12  1500  f2)  ’(350  1300  351  1750)) 

(intersect  (list  fl  f2  1500  f2)  ’(600  1300  601  1750)))) 

M  princ  7 ) 

(aset  7  result  time ) ) 

Maset  500  formants  1  time) 

Maset  1500  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  12)  ’(400  950  401  1300)) 

(intersect  (list  fl  f2  1500  f  2  >  '(600  950  601  1300)))) 

M princ  8  I 

(aset  8  result  time)) 

Maset  500  formants  1  time) 

Maset  1000  formants  2  time)) 

((xor  (list  (intersect  (list  fl  f2  1500  f  2  >  M  200  500  201  1300)) 

(intersect  (list  fl  f2  1500  f 2 )  ’(400  500  401  1300)))) 

; ( princ  9  ) 

( aset  9  result  time ) I 
Maset  300  formants  1  time) 

Maset  900  formants  2  time!) 

((xor  (list  (intersect  (list  fl  f2  1500  f 2 )  '(400  500  401  950)) 

(intersect  (list  fl  f2  1500  f 2)  '(600  950  601  1100)) 

(intersect  (list  fl  12  1500  f  2 )  '(650  950  651  11001) 

(intersect  (list  fl  f2  1500  f2)  '(600  500  800  950)1)) 

M princ  ’ 0 ) 

(eset  10  result  time)) 

Meset  601  formants  1  time) 

Meset  800  formants  2  time)) 


B-1 6 


WWW 


( t  ; (princ  0 ) 

(aset  0  result  tint)))) 
;(aset  0  formants  1  time) 
;(aeat  0  formants  2  tima)))) 

result ) ) 


;  ;  "MEDIAN-FILTER" 

;;;  This  function  median  filters  the  5  by  length  formant  array  returned  by  SPIRE.  This 
;;;  is  an  effort  to  smooth  the  formants  values  to  remove  the  gliches  when  the  formant 
;;;  tracker  loses  track.  Note  that  the  <  0 ,  i )  row  has  all  zeroe  values. 

(defun  median-filter  (array) 

(let*  ((rows  ( array-dimension-n  1  array)) 

(columns  (array-dimension-n  2  array)) 

(return-array  (make-array  (list  rows  columns))) 

(window-vector  (make-array  11))) 

(copy-array-contents  array  return-array) 

(do*  ((row-index  1  (1+  row-index))) 

((*  row-index  rows)) 

(do*  ((column-index  5  (1+  column-index))) 

( (»  column-index  (-  columns  5 ) ) > 

(do*  ((window-index  (-  column-index  5)  (1+  window-index)) 

(window-vector-index  0  (1+  window-vector-index  )  )  ) 

( («  window-vector-index  11)) 

(aset  (aref  array  row-index  window-index)  window-vector  window-vector-index)) 
(aset  (aref  (sort  window-vecto r  '<)  4)  return-array  row-index  column-index))) 
return-array ) ) 


"GET-PATHNAME" 

;;;  Function  to  get  a  pathname  from  user 
;;;  providing  prompt  and  default  pathname. 

(defun  get-pathname  (default) 

(fs :set-default-pathname  default  ) 

(prompt-and-read  '<:pathname  : visible-default  , fs : *def ault-pathname-def suits* ) 
"Enter  pathname  «>  ")) 


"show-list" 

(defun  show-list  (alist) 
(loop  for  element  in  alist 
do  (print  element))) 


;;;  "DUMP-TO-DISK" 

;;;  Function  to  dump  data  to  a  disk  file. 

; ; ;  Input  :  Full  Path  and  Filename,  thing  to  dump 
;;;  Output  :  Writes  a  compiled  Lisp  form  to  disk 

such  that  when  loaded  (like  any  ordinary  lisp  form) 
the  data  is  setq’d  to,  in  this  case,  'data*. 

(defun  dump-to-disk  (pathname  data) 

( ays : dump-f o rms-to-f l le  pathname  (list  '(setq  'data*  ’.data)))) 

;;;  Note:  To  read  this  data,  (load  pathname). 

The  global  variable  'data*  will  then  contain  the  data. 


Function  to  square  a  nuabar 


(dafun  aq  r  (nuabar)  (*  nuabar  nuabarl) 


; ; ;  "CLEARSCREEN" 

;;;  Function  to  claar  tha  scraan. 

;;;  No  arguments  required. 

(dafun  clearscreen  ( ) 

(sand  tv : salactad-window  :  clear-window ) ) 


1 


;;;  "SUBLIST" 

;;;  This  function  takas  as  input  a  list  and  raturns  a  sublist  of 
;;;  alaaants  i  thru  j. 

;;;  Exaapla:  foo  ■>  (a  b  c  d  a  f) 

;;;  (sublist  foo  1  3)  ■>  (bed) 

(dafun  sublist  (alist  i  j) 

(lat  ( ( raturn-list  (list  (nth  i  alist)))) 

(do*  (  (sarkar  (1+  i)  (14  siarkarl) 

(and  (1+  j  )  )  ) 

( («  aarkar  and) ) 

(satq  raturn-list  (appand  raturn-list  (list  (nth  aarkar  alist))))) 
raturn-list ) ) 


;;;  "COLUMN-AVERAGE” 

;;;  This  function  will  avaraga  a  subcoluan  from  a  column  of  a  2-D  array. 

;;;  It  takas  as  input  tha  array,  tha  column  numbar,  indaxas  i  and  j.  It  avaragas 
;;;  tha  array  alaaants  i  thru  j  of  tha  spacifiad  column  nuabar. 

;;;  Input:  2-D  arrar,  coluan,  i,  j 
;;;  Output:  Avaraga 

(dafun  column-avaraga  (array  colunn  i  j) 

( lat  ( ( sum  0.0) 

(and  (1+  j) ) ) 

(do  ((count  i  (1+  count))) 

( ( ■  count  and ) ) 

(satq  sum  (4  sum  (araf  array  count  column)))) 

(//  sum  ( 14  (-  j  i  )  )  )  )  ) 


"■ VECTOR-ENERGY-NORMALI ZE  " 


Function  to  noraaliza  a  ona  dimensional  array  by  anargy 
Dascription : 

Tha  total  anargy  of  tha  array  is  caleulatad 
by  summing  tha  squaras  of  all  tha  alaaants  and 
taking  tha  squara  root  of  that  sum. 

Tha  noraalizad  array  is  foraad  by  dividing 

aach  element  of  tha  input  array  by  tha  total  anargy. 


Input:  ona  diaansional  array 
Raturns:  noraalizad  varsion  of  input 


(dafun  vactor-anargy-noraaliza  (vactor) 

(lat  ( ( raturn-array  (aaka-array  ( array-langth  vactor))) 
(total-anargy  0)) 

(do  ( (counter  0  <14  countar)) 

(andaark  (array-langth  vector))) 


1 


B— 7  8 


•  .V 


y'jf 


Jiaa 


«-»  m  Vl'Kl  *-»  !■>  < 


<(«  counter  endmark)) 

(setq  total-energy  (+  total-energy  (sqr  (aref  vector  counter))))) 

(setq  total-energy  (//  (sqrt  total-energy)  (array-length  return-array))) 
(do  ((counter  0  (It  counter)) 

(endmark  (array-length  vector))) 

((-  counter  endaark)) 

(aset  (//  (aref  vector  counter)  total-energy)  return-array  counter)) 
return-array ) ) 


"VECTOR— MAGNITUDE-NORMAL! ZE" 


This  function  is  siailar  to  VECTOR-ENERGY-NORMALIZE  except  that  the 
values  from  the  input  vector  are  simply  mapped  into  a  range  of  0  to  1 . 

In  other  words,  the  smallest  value  of  the  input  array  will  be  mapped  to  zero 
and  the  largest  value  mapped  to  one;  all  others  will  fall  somewhere  in 
between.  This  normalization  technique  is  arises  from  the  fact  that  the 
VECTOR-ENERGY-NORMALIZATION  technique  fails  for  vectors  of  unequal  length. 


Input:  One  dimensional  array. 

Returns;  Normalized  version  of  input. 


(defun  vector-mag-norm  (vector) 

(let*  ( ( result-array  (make-array  (array-length  vector))) 

(vector-max  -999999.0) 

(vector-min  999999.0) 

(diff  0.0) 

(scale  0.0) 

(mapmin  0.0) 

(mapmax  1.0) 

(length  ( array-dimension-n  1  vector))) 

(do  ( (i  0  (1+  i))) 

( (»  i  length )  ) 

(cond  ((<  (aref  vector  i)  vector-min)  (setq  vector-min  (aref  vector  i))) 

((>  (aref  vector  i)  vector-max)  (setq  vector-max  (aref  vector  i))))) 
(setq  diff  (-  mapmin  vector-min)) 

(setq  scale  (//  mapmax  <  +  vector-max  diff))) 

(do  ( (i  0  (1+  i  )  )  ) 

( («  i  length ) ) 

(eset  (*  (  +  (aref  vector  i)  diff)  scale)  result-array  i ) ) 
result-array )  ) 


TREQUENCY-COMPRESS-LC" 


This  function  takes  an  array  returned  by  (compute-att  utt-name  "Wide-Band  Spectrum") 
which  is  a  256  by  length  array.  256  represents  the  frequency  components 
of  the  utterance  and  length  is  proprtional  to  time.  This  function  reduces 
the  frequency  resolution  from  256  to  16.  This  is  a  linear  compression  ( LC ) . 


Input:  Two  dimensional  array  returned  by  (compute-att  utt-name  "Wide-Band  Spectrum") 

Output:  Compressed  version  of  input 


(defun  f requency-compress-lc  (array) 

(let*  ((row-length  ( ar ray-dimens lon-n  2  array)) 

(return-array  (make-array  (list  16  row-length))) 

(block-sum  0 ) ) 

(do*  ((current-column  0  (1+  current-column))) 

( («  current-column  row-length)) 

(do*  ((current-block  0  (1+  cur  rent -bl ock >> ) 

((■  current-block  16)) 

(setq  block-sum  0) 

(do*  ((current-element  (•  current-block  16>  (1+  current-element))) 

( (•  current-element  (*  (1+  cur  cent -bl ock  I  16))) 

I setq  block-sum  (♦  block-sum  (aref  array  current-element  current-column)))) 
(eset  (//  block-sum  16)  return-array  cur  rent -bl ock  cu r rent -column ))  ) 

return-array ) ) 


"PREQUENCY-COMPRESS-LrE " 


_  -it 


* 


% 


«  w'*  >>  ,■»  _V 


;;;  This  function  takes  th#  array  returned  by  ( cosipute-ett  utt-name  "Wida-Band  Spectrum") 
;;;  which  is  256  by  lenghth  array.  The  256  discrete  frequency  components  will  be 
;;;  compressed  down  to  16.  This  compression  is  done  with  low  frequency  emphasise  ( LFC ) . 
;;;  It  is  not  a  logrithmic  compression.  Rather,  the  lower  112  frequency  components 
are  are  linearly  compressed  down  to  12,  and  the  higher  124  components  are 
;;;  linearly  compressed  down  to  4.  This  algorythm  is  written  so  as  to  make  changing 
the  emphasise  easy  if  desired. 

;;;  Input:  Two  dimensional  array  returnad  by  (compute-att  utt-name  "Wide-Band  Spectrum") 
;;;  Output:  Compressed  version  of  input 

(defun  f requency-compress-lfe  (array) 

(let*  ((length  (array-dimension-n  2  array)) 

(return-array  (make-array  (list  16  length)))) 

(do  ((count  0  (1+  count))) 

( («  count  length ) ) 

(aset  (column-average  array  count  0  10)  return-array  0  count) 

(aset  (column-average  array  count  11  21!  return-array  1  count) 

(aset  (column-average  array  count  22  32)  return-array  2  count) 

(aset  (column-average  array  count  33  43)  return-array  3  count) 

(aset  (column-average  array  count  44  54)  return-array  4  count) 

(aset  (column-average  array  count  55  65)  return-array  5  count) 

(aset  (column-average  array  count  66  76)  return-array  6  count) 

(aset  (column-average  array  count  77  87)  return-array  7  count) 

(aset  (column-average  array  count  88  98)  raturn-array  8  count) 

(aset  (column-average  array  count  99  109  )  return-array  9  count  I 
(aset  (column-average  array  count  110  120)  return-array  10  count) 

(aset  (column-average  array  count  121  131)  return-array  11  count) 

(aset  (column-average  array  count  132  162)  return-array  12  count) 

(aset  (column-average  array  count  163  193)  return-array  13  count) 

(aset  (column-average  array  count  194  224)  return-array  14  count) 

(aset  (column-average  array  count  225  255)  return-array  15  count)) 

return-array ) ) 


; : ;  "menu-feature-set" 

(defun  menu-feature-set  ( ) 

(let*  ((item-list  '("Wide  Band  Spectrum” 

"Narrow  Band  Spectrum” 

" LPC  Spectrum" 

"Formants " 

"LPC,  Formants,  Fr.  Freq.” 


)  I 

(menu  ( tv :make-window  ' tv : momenta ry-menu 

':label  "Word-Search! 

Select  Feature  Set  to  Use...")) 

( choice  )  ) 

(send  menu  ': set-item-1 ist  item-list) 

(setq  choice  (send  menu  ':choose)) 
choice )  ) 


;  "CREATE-READY-TEMPLATE-FILE” 

;  This  is  the  function  for  creeting  a  Ready-Template  file 
;;;  (see  word-seerchi.doc).  This  is  accomplished  by 

reading  each  word  of  the  vocabulary  (digits  "taro"  thru  "nine") 
;;;  one  by  one.  Various  SPIRE  computations  are  performed,  and  saved 
;;;  to  a  disk  file.  The  user  is  prompted  for  both  input  and 
;  output  pathnames  . 

Input:  None  (User  is  prompted  for  read  and  write  pathnames) 

Output:  Writes  Ready-Template  File  to  Disk 

(defun  create-ready-template-f lie  () 

(let*  ((read-directory  (string-append 


f 

f 


"spl : > d«wson> thes i s  > t empl a tes > " 

( prompt-and-read  :string 

"Please  enter  speaker  him:  ") 

">")  ) 

( read-path ) 

(write-path  ( «t  ring-append 

"spl : > ds ws on > t he s 1 s  > t empl a t es >  " 

(prompt-and-read  string 

"Please  enter  Ready-Template  nans  :  "))) 


<  chovci  nil)) 

(••tq  *t-set*  ml) 

(setq  *tempath*  read-directory) 

(••tq  choice  (Mnu-futurt-s«t  )  ) 

(loop  for  v-word  in  "vocabulary*  do 

(••tq  read-path  (string-append  read-directory  v-word  ".utt" )) 

( t«  rpr 1 > 

(princ  "Processing  ") 

(pnnc  read-path) 

(princ  "...") 

(••tq  *t-set*  (append  *t-set* 

(list 

(cond  ((equal  choice  "Wide  Band  Spectrum") 

( process-ut ter ance-wbs  r«ad-path) ) 

((•qual  choice  "Narrow  Band  Spectrum") 

( p r oc« s s -u 1 1 • r inc»-nbs  raad-path)  ) 

((•qual  choice  "LPC  Spectrum") 

( pr ocess-utterance-lpc  read-path ) ) 

((•qusl  choic*  "Formants") 
(proc«55-utt«ranct-formants  read-path)  ) 

((•qual  choica  "LPC,  Formants,  Fr.  Freq.") 

( process-utter ance-lpc-formants-ff  read-path)  )))))) 
(tend  (»pir*:utt»rinc«  read-path )  :kill)) 
i  dump- to-di  sk  write-path  (list  *t-set*  *tempath*))) 

(  word-i»«rch  !  )  * 


"CREATE-READY -UTTERANCE- FILE " 

This  is  the  function  for  creating  •  Ready-Ut t»ranc*  f i 1 • 
( •••  word-searchi.doc).  This  is  accomplished  by 
reading  a  Digitised  Continuous  Utterance. 

Various  SPIRE  computations  are  performed,  and  saved 
to  a  disk  file.  The  user  is  prompted  for  both  input  and 
output  pathnames 


Input:  None  (User  is  prompted  for  read  and  write  pathnames) 

Output-  Writes  Ready-Template  File  to  Disk 

idefun  create -ready-utterance-f lie  () 

'let*  ' ' read-path 

< st  r ing-append 

"spl : >dawson>thesis>utterances>" 

( pr ompt -and- read  : string  "Name  of  Digitised  Continuous  Utterance  :") 

" .utt " ) ) 

( wr 1 1  e-pa t  h 

l »tr ing-append 

"apl :> da wa on >t he sis >utterances>" 

(  p  r  ompt -and- read  :stnng  "Name  of  Ready-Utterance  "))) 

( cho ice  nil)) 

(••tq  choice  (menu-feature-set)) 

( setq  * r eady-ut terance * 

(cond  ((equal  choice  "Wide  Band  Spectrum") 

( process-ut t erance-wbs  read-path) ) 

((equal  choice  "Narrow  Band  Spectrum") 

( process-ut terance-nbs  read-path) ) 

((equal  choice  "LPC  Spectrum") 

( process-ut te rance-lpc  read-path) ) 

((equal  choice  "Formants") 
(process-utterance-formants  read-path) \ 
((equal  choice  "LPC,  Formants.  Fr .  Freq  ’  ' 
(proc ess -utterance-1  pc -for mants-ff  read -path  < 


(••tq  *uttpath*  read-path) 


(send  < spi re : utterance  read-path)  : k i 1 1 1 
(dunp-to-disk  write-path  (list  *ready-utteranca4  *»anht-l  tat  *  4uttpath4))) 
I word-search !  ) ) 


.  .  .  *  LOAD-READY -TEMPLATE-T I LE " 

Thia  (unction  loads  a  Ready-Tenplate-Fi la  and  set's  it  to  *t-set* 

Input  :  Nona,  user  is  prompted  for  Itaady-Tasiplsta  Nana 
Output  :  The  global  *t-aat*  is  sat  to  Ready-Tenplate  Nana 

(dafun  load-raady-taaplata-(i la  I) 

(let*  ( l read-path  ( st r ing-eppend 

"spl : > daws on >  thas is  >  tsnp lates > ~ 

(pronpt-and-raad  string  "Nana  of  Ready-Tenplate  :  "  )  )  )  ) 

I  load  read-path) 

( aetq  4t-set4  (car  ‘data4)) 

(aatq  'tenpath*  (cadr  ‘data4))) 

( word-search !  )  ) 


:  "LOAD-READY-UTTERANCE-riLE" 

This  (unction  loads  a  Ready-ut t a r anca-P 1  la  and  satq's  it  to  4 ready-utterance4 

Input  :  Nona,  tha  user  is  pronptod  for  Ready-Utteranca  Nana 

Output  :  Tha  global  4 raady-ut t ananca 4  is  setq'd  to  Ready  Utterance  Nana 

(dafun  1 oad- raady-ut ta ranca-f i 1  a  (  ) 

(let4  ((read-path  ( at r ing-appand 

" apl : > daws on > thesis  > utterances > " 

(pronpt-and-raad  ratring  "Nana  of  Raady-Utteranca  :  " )  )  )  ) 

(load  read-path) 

(aatq  4 raady-ut teranca 4  (car  4data4ll 
(aatq  ‘weight-list4  (cadr  ‘data4  )  I 
(aatq  *uttpath4  (caddr  ‘data4))) 

( word-search ) ) ) 


:  -OISPLAY-WAVEfORN" 

(dafun  display-wavaforn  (xl  yl  x2  y2  pathnana) 

(let4  ((display-array  (apiraratt-val  (sand  ( spi ra : utterance  pathnana) 

rfind-att  "original  wavaforn" )  )  ) 
(length  (array-length  display-array!) 

( width  ( -  x 2  xl  )  ) 

(height  (-  y2  yl )  )  ) 

(declare  ( ays : at ray-ragiatar  display-array)) 

(drawbordar  xl  yl  x2  y2 ) 

(loop  for  indaxl  fixnun  (com  0  to  (-  length  2) 

for  index2  fixnun  fron  1  to  (1-  length)  do 
(sand  tv : selected-window  rdraw-lina 

(a  xl  (fix  (*  indaxl  (//  width  (float  length))))) 

(+  yl  (fix  (•  (+  (araf  display-array  indaxl)  32767.0) 

(//  height  65535.0  ) )  )  ) 

(+  xl  (fix  (*  index:  (//  width  (float  length))))) 

(+  yl  (fix  <4  (♦  (araf  display-array  index2)  32767.0) 

(//  height  65535.0  )  ))))))) 


:::  "display -wave form- rot" 

(dafun  display-wavaf orn-rot  (xl  yl  x2  y2  pathname) 

(let4  ((display-array  ( spire : att-val  (sand  ( api re (utterance  pathname) 

:find-att  "original  waveform"))) 
(length  (array-length  display-array)) 

(width  (-  x2  xl )  ) 

(height  (-  y2  yl )  )  I 


(daelara  < ays :array-ragistar  display-array)) 
(drawbordar  xl  yl  x2  y 2 ) 

(loop  for  indaxl  fixnua  froa  0  to  (-  longth  2) 

for  indax2  fixnua  froa  1  to  (1-  longth)  do 


aond  tv : aoloctod-window  : draw-lino 

(+  xl  (fix  (*  (+  (arof  display-array  indaxl)  32767.0) 
(//  width  65535.0)  )  )  ) 

(-  y2  (fix  (*  indaxl  (//  haight  (float  longth))))) 

(+  xl  (fix  (*  (+  (arof  display-array  indax2)  32767.0) 
(//  width  65535.0)  )  ) ) 

(-  y2  (fix  (•  indax2  <//  haight  (float  longth))))))))) 


Appendix  C:  Sample  Results 


k 


5WBW 

nWECHlBB 


8 


nflttNM 


RGD  —  "28318"  --  LPC  Spectrum 


I 


9 


8 


'jxm 


L 


haaffll 

zvuaBsm 


*  w  ^rnmmyur^wm 


SKIP/JONES  -  "1555276"  LPC  Spectrum,  FprmanH,  F.F. 


r  ■■  «■■■»  i  -  i^.mn  mvi 


0 


RGD/JONES  -  "28318"  LPC  Spectrum,  Formants,  F.F 


i 


i 

i 


l 


MBKKMB 

□MJLJfiXi 

ammsum 

□ram 


lin  inmum  MM 


EiyjRUlU 


□ORBI 


»■■•-•  «alli«ak4 


0 


ft  JfcAJi  MMftjftikJilMJLAA  JjM 


1.  Abut,  Huseyin  and  Robert  M.  Gray.  "Vector  Quantization  of  Speech 
and  Speech-Like  Waveforms,"  IEEE  Trans.  Acoust.,  Speech,  Signal 
Processing.  ASSP-25 :  299-309  (August  1977). 

2.  Brusuelas,  Capt  Micheal  A.  Investigation  of  Speaker-Independent 
Word  Recognition  Using  Multiple  Features,  Decision  Mechanisms,  and 
Template  Sets.  MS  Thesis,  AFIT/GCE/ENG/86D-5 .  School  of 
Engineering,  Air  Force  Institute  of  Technology  (AU), 
Wright-Patterson  AFB  OH,  December  1986. 

3.  Burton,  David  K.,  John  E.  Shore  and  Joseph  T.  Buck.  "Isolated-Word 
Speech  Recognition  Using  Multisection  Vector  Quantization 
Codebooks,"  IEEE  Trans.  Acoust.,  Speech,  Signal  Processing. 

ASSP-33 :  837-849  (August  1985). 

4.  Doddington,  George  R.  and  Thomas  B.  Schalk.  "Speech  Recognition, 
Turning  Theory  to  Practice,"  IEEE  Spectrum.  1 8 :  26-32  (September 
1981).  ' 

5.  Juang,  Biing-Hwang  and  Lawrence  R.  Rabiner.  "Mixture 

Autoregressive  Hidden  Markov  Models  for  Speech  Signals,"  IEEE 
Trans.  Acoust.,  Speech,  Signal  Processing.  ASSP-33 :  1404-1413 

(December  1985). 

6.  Kabrisky,  Mathew,  Professor.  Personal  Interview.  School  of 
Engineering,  Air  Force  Institute  of  Technology  (AU), 
Wright-Patterson  AFB  OH,  3  February  1987. 

7.  Kassel,  Robert  H.  A  User’s  Guide  to  SPIRE.  [correspnds  to  version 
17.51  MIT  Speech  Recognition  Group,  Mar  1985. 

9.  Kauffman,  David  H.  SPIRE  17  Release  Notes.  MIT  Speech  Group 
[supported  by  DARPA  contract  N00039-85-C-0290  monitored  through 
Naval  Electronic  Systems  Command],  January  1986. 

10.  Ney,  Hermann.  "The  Use  of  a  One-Stage  Dynamic  Programming 

Algorithm  for  Connected  Word  Recognition,"  IEEE  Trans.  Acoust., 
Speech,  Signal  Processing.  ASSP-32 :  263-271  (April  1984). 

11.  Potter,  R.  K.,  George  Kopp,  and  Harriet  Green.  Visible  Processing 
of  Speech  Signals.  New  York:  D.  Van  Nostrand  Company,  Inc.,  1947. 

12.  Rabiner,  Lawrence  R.  and  Ronald  Schafer.  Digital  Processing  of 
Speech  Signals.  New  Jersey:  Prentice  Hall,  Inc.,  1978. 

13.  Rabiner,  Lawrence  R.  and  Jay  G.  Wilpon.  "Speaker-Independent 
Isolated  Word  Recognition  for  a  Moderate  Size  (54  Word) 

Vocabulary,"  IEEE  Trans.  Acoust.,  Speech,  Signal  Processing. 

ASS P -2 7:  583-587  (December  1979). 


C 


Bib-1 


Rothfeder,  Jeffery.  "Hardware:  A  Few  Words  about  Voice 
Technology,"  PC  Magazine.  5:  191-205  (30  September  1986). 

SPIRE  17.2  Preliminary  User's  Guide.  Speech  Communications  Group 
Research  Laboratory  of  Electronics,  Massachussetts  Institute  of 
Technology,  February  1986. 

SPIRE  17.2  Reference  Manual.  Speech  Communications  Group, 
Research  Laboratory  of  Electronics,  Massachussetts  Institute  of 
Technology,  February  1986. 

Winston,  Patrick  H.  and  Berthold  Horn.  LISP.  (Second  Edition) 
Massachussetts:  Addison-Wesley  Publishing  Company,  1984. 


UNCLASSIFIED 

ilCURITY  CLASSlFlCA' 


la  REPORT  SECURITY  CLASSIFICATION 

UNCLASSIFIED 


REPORT  DOCUMENTATION  PAGE 

ION  lib  RESTRICTIVE  MARKINGS 


Form  Approved 
OMB  No  0704-0188 


2a  SECURITY  CLASSIFICATION  AUTHORITY 
2b  DECLASSIFICATION  /DOWNGRADING  SCHEDULE 


3  DISTRIBUTION /AVAILABILITY  OF  REPORT 

APPROVED  FOR  PUBLIC  RELEASE 
DISTRIBUTION  UNLIMITED 


4  PERFORMING  ORGANIZATION  REPORT  NUM8£R(S) 

AFIT/GE/ENG/87D-14 


5  MONITORING  ORGANIZATION  REPORT  NUMBER(S) 


(a  NAME  OF  PERFORMING  ORGANIZATION  6b  OFFICE  SYN 
School  of  Engineering  (if  applicah 

_ AFIT/ENG 

6c  ADDRESS  (C/Ty.  Stare,  tnd  ZIP  Code) 

Air  Force  Institute  of  Technology 
Wright-Patterson  AFB,  OH  45433 


6b  OFFICE  SYMBOL  I  7a  NAME  OF  MONITORING  ORGANIZATION 
(If  applicable)  I 


7b  ADDRESS  (City.  State,  »nd  ZIP  Code) 


Ba  NAME  OF  FUNDING  SPONSORING 
ORGANIZATION 


8b  OFFICE  SYMBOL  |9  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
(If  applicable)  [ 


I  Be  ADDRESS  (C/fy  State,  and  ZIP  Code) 


10  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM 
ELEMENT  NO 


PROJECT 

task 

WORK  UNIT 

NO 

NO 

ACCESSION  NO 

11  Title  (Include  Security  Classification) 

SPIRE  based  Speaker-Independent  Continuous  Speech  Recognition  Using  UNCLASSIFIED 
Mixed  Feature  Sets 


12  personal  autmor(S) 

Dawson,  Robert  C, .  Captain  USAF 

13a  TYPE  OF  REPORT  3b  TIME  COVERED 

MS  Thesis  from _ to 

16  supplementary  notation 


14  DATE  OF  REPORT  (Year,  Month.  Day)  15  PAGE  COUNT 

1987  December  125 


17 _ _ COSati  CODES _  18  SUBJECT  TERMS  (Continue  on  reverse  if  necessary  and  identify  by  block  number) 

~  field  group  Subgroup  Speech  Recognition,  SPIRE,  Dynamic  Programming 
17  02  Mixed  Feature  Sets 

19  ABSTRACT  ( Continue  on  reverse  if  necessary  and  identify  by  block  number) 

Thesis  Chairman:  Matthew  Kabriski,  PhD 

Professor  of  Electrical  Engineering 


Ppro»*d  foi  put; 


i: 

L*OU  lot  1>... 

An  lt.nr  I >t 

)k’“Vtu  IttUr,  -  , 


,aV.’  Am  iso-  1 

2lft«.»7  ' 

it 


20  OiS TRiBU T'ON  AVAilABU.Ty  of  ABSTRACT  21  ABSTRACT  SECURITY  CLASSIFICATION 

fl  jNClASSiFiEDTjNL'Mi’ED  □  same  as  rpt  □  OTIC  USERS  UNCLASSIFIED 


22a  NAME  OF  RESPONSES  'NOviDual  |22b  TELEPHONE  (Include  Area  CocfeTj  22c  OFFICE  SYMBOL 

_ Dr.  Matthew  Kabriski  Professor,  OS-15  I  (513)  255-5276 


00  Form  1473,  JUN  14  Previous  editions  are  obsolete  SECURITY  CLASSIFICATION  OF  THl1 


SECURITY  CLASSIFICATION  OF  THIS  PAGE 

UNCLASSIFIED 


sere 


u 


UNCLASSIFIED 


Continued  from  block  19:  Abstract 

A  system  was  developed  to  investigate  continuous  speech 
recognition.  The  system  incorporates  multiple  features  and  dynamic 
programming  to  recognize  continuous  inputs  of  the  spoken  digits  (zero 
through  nine) .  The  fundamental  design  concept  extends  from  previous 
successful  recognition  research  efforts  involving  both  isolated  and 
continuous  speech  using  multiple  feature  sets,  multiple  template  sets, 
and  dynamic  programming.  Among  the  features  used  in  the  investigation 
are  wide  band  spectrogram,  narrow  band  spectrogram,  linera  predictive 
coding  (LPC)  coefficients,  LPC . spectrum,  frication  frequency,  and 
formant  tracks.  An  advanced  speech  research  tool  called  SPIRE  provided 
the  computational  functions  needed  to  extract  the  raw  features.  The 
system  is  implemented  in  LISP  on  a  Symbolics  3600  series  LISP  machine. 


UNCLASSIFIED 


END 

Filmed 

n  ? S' 

D  Tll. 


